[00:02:05] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:02:17] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:21] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-12 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:10:49] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-12 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:17:15] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:49] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS bullseye
[00:28:20] <wikibugs>	 (03PS1) 10Dzahn: add gerrit-replica-new.wikimedia.org, point to 208.80.153.109 [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250)
[00:32:59] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-12 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:39:09] <wikibugs>	 (03PS4) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241)
[00:39:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2051.codfw.wmnet with reason: host reimage
[00:43:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2051.codfw.wmnet with reason: host reimage
[00:54:28] <wikibugs>	 (03PS1) 10Tim Starling: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815285 (https://phabricator.wikimedia.org/T296188)
[00:56:31] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add gerrit role and hiera settings for replica to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250)
[00:56:33] <wikibugs>	 (03PS1) 10Tim Starling: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815406 (https://phabricator.wikimedia.org/T296188)
[00:57:58] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:58:33] <wikibugs>	 (03PS1) 10Tim Starling: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815407 (https://phabricator.wikimedia.org/T296188)
[00:58:56] <wikibugs>	 (03PS1) 10Tim Starling: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815408 (https://phabricator.wikimedia.org/T296188)
[01:00:28] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2051.codfw.wmnet with OS bullseye
[01:01:15] <wikibugs>	 (03PS1) 10Dzahn: acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250)
[01:04:49] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2052.codfw.wmnet with OS bullseye
[01:08:18] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add gerrit2002 to firewall rules for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250)
[01:10:16] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-19 00:00:01 (3282 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:10:55] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250)
[01:12:06] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 102.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[01:12:21] <wikibugs>	 (03CR) 10Dzahn: "this goes into /var/lib/gerrit2 on gerrit1001. that's the actual home dir" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[01:15:46] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250)
[01:24:42] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[01:24:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2052.codfw.wmnet with reason: host reimage
[01:26:05] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "this is not ready yet but I wanted to list it for tomorrow's meeting" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[01:27:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2052.codfw.wmnet with reason: host reimage
[01:27:17] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "can't be merged before we have the IP in netbox and DNS" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[01:28:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "This can go first to get things out of the way I suppose." [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[01:28:30] <wikibugs>	 (03CR) 10Dzahn: "I should also give you shell to gerrit2002..." [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[01:31:35] <wikibugs>	 (03PS1) 10Dzahn: admin/gerrit: add gerrit shell admins on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815402 (https://phabricator.wikimedia.org/T313250)
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:20] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[01:44:42] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250)
[01:49:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2052.codfw.wmnet with OS bullseye
[01:53:03] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815285 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[01:53:09] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815406 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[01:57:22] <icinga-wm>	 PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:24] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 100.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[02:04:28] <wikibugs>	 (03PS1) 10Tim Starling: Switch testwiki to multi-DC active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664)
[02:10:13] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815285 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[02:11:43] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815406 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[02:12:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:13:35] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-19 00:00:01 (3261 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:19:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:23:57] <icinga-wm>	 RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:25:13] <icinga-wm>	 RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:25:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:25:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:27:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:31:44] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815407 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[02:31:47] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815408 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[02:41:36] <TimStarling>	 I am doing this merge and deployment for Winston_Sung[m], following the discussion in this channel last night my time
[02:42:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:44:47] <TimStarling>	 because I read the task comments with the "confusion, concern and shock" and all that
[02:46:30] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815407 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[02:47:40] <TimStarling>	 not saying I know what the big deal is, I know spoken Cantonese is quite distant from Mandarin but I thought the written languages were pretty close?
[02:48:05] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815408 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling)
[02:48:58] <Winston_Sung[m]>	 > I am doing this merge and deployment for Winston_Sung, following the discussion in this channel last night my time
[02:48:58] <Winston_Sung[m]>	 Thanks for the help.
[02:49:54] <Winston_Sung[m]>	 The biggest issue is that they don't want to see the Simplified Han script on the wiki.
[02:51:12] <logmsgbot>	 !log tstarling@deploy1002 Started scap: revert yue -> zh fallback, needs LC rebuild in both branches T296188
[02:51:14] <Winston_Sung[m]>	 And due to the updated fallback chain to zh and zh-hans, the Tech News pushed the one contains Simplified Han script.
[02:51:15] <stashbot>	 T296188: Clean up, merge, update zh/zh-* translations and update zh-related language fallback chains in mediawiki/core - https://phabricator.wikimedia.org/T296188
[02:51:58] <TimStarling>	 right
[02:53:08] <Winston_Sung[m]>	 They strongly opposed to have the Simplified Han script content.
[02:53:43] <Winston_Sung[m]>	 So the fallback chain update for yue need more discussions.
[02:54:26] <TimStarling>	 got it
[02:54:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:58:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:58:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:59:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:10:53] <logmsgbot>	 !log tstarling@deploy1002 Finished scap: revert yue -> zh fallback, needs LC rebuild in both branches T296188 (duration: 19m 41s)
[03:10:58] <stashbot>	 T296188: Clean up, merge, update zh/zh-* translations and update zh-related language fallback chains in mediawiki/core - https://phabricator.wikimedia.org/T296188
[03:15:01] <TimStarling>	 OK, that worked, I tested this special page alias before and after: https://zh-yue.wikipedia.org/wiki/Special:%E6%89%80%E6%9C%89%E9%A1%B5%E9%9D%A2
[03:15:31] <TimStarling>	 now 404, previously it was a redirect
[03:16:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:16:25] <icinga-wm>	 PROBLEM - Host dns1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:16:25] <icinga-wm>	 PROBLEM - Host authdns1001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:16:25] <icinga-wm>	 PROBLEM - Host logstash1011 is DOWN: PING CRITICAL - Packet loss = 100%
[03:16:25] <icinga-wm>	 PROBLEM - Host bast4003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:17:38] <icinga-wm>	 PROBLEM - Host kubemaster1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:17:50] <icinga-wm>	 PROBLEM - Host wcqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:02] <icinga-wm>	 PROBLEM - Host kubernetes1012 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:04] <icinga-wm>	 PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:08] <icinga-wm>	 PROBLEM - Host wtp1038 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:08] <icinga-wm>	 PROBLEM - Host wtp1037 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:08] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:18:16] <icinga-wm>	 PROBLEM - Host db1146 #page is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:18] <icinga-wm>	 PROBLEM - Host wtp1039 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:33] <icinga-wm>	 PROBLEM - Host pc1013 #page is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:37] <icinga-wm>	 PROBLEM - Host db1120 #page is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:52] <icinga-wm>	 PROBLEM - Host db1145 is DOWN: PING CRITICAL - Packet loss = 100%
[03:18:52] <icinga-wm>	 PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:04] <icinga-wm>	 PROBLEM - Host dbproxy1021 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:04] <icinga-wm>	 PROBLEM - Host dbproxy1019 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:04] <icinga-wm>	 PROBLEM - Host mwdebug1001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:06] <icinga-wm>	 PROBLEM - Host dbproxy1020 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:06] <icinga-wm>	 PROBLEM - Host matomo1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:06] <icinga-wm>	 PROBLEM - Host logstash1025 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:10] <icinga-wm>	 PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:10] <icinga-wm>	 PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:15] <rzl>	 wuh oh
[03:19:16] <icinga-wm>	 PROBLEM - Host aqs1005 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:20] <icinga-wm>	 PROBLEM - Host an-tool1007 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:20] <icinga-wm>	 PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:20] <icinga-wm>	 PROBLEM - Host an-conf1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:21] <icinga-wm>	 PROBLEM - Host db1181 #page is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:25] <TheresNoTime>	 Blame Tim
[03:19:28] <icinga-wm>	 PROBLEM - Host dbproxy1018 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:36] <TheresNoTime>	  (joke)
[03:19:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye
[03:19:57] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic2060.codfw.wmnet with OS bullseye
[03:19:59] <rzl>	 pretty sure I'm gonna blame a rack switch but let's see
[03:20:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 321 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:20:06] <icinga-wm>	 PROBLEM - Host aqs1013 is DOWN: PING CRITICAL - Packet loss = 100%
[03:20:56] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:21:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on db1171 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1181.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:22:08] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 #page on db1127 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1181.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:22:09] <icinga-wm>	 PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[03:22:09] <icinga-wm>	 PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:22:14] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{acc
[03:22:14] <icinga-wm>	 }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[03:22:18] <Niharika>	 who took down meta?
[03:22:21] <icinga-wm>	 RECOVERY - Host mwdebug1001 is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms
[03:22:21] <icinga-wm>	 RECOVERY - Host ps1-e3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms
[03:22:23] <icinga-wm>	 RECOVERY - Host db1181 #page is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms
[03:22:23] <icinga-wm>	 RECOVERY - Host doh1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms
[03:22:24] <rzl>	 hm, it's more than one rack
[03:22:25] <icinga-wm>	 RECOVERY - Host db1146 #page is UP: PING OK - Packet loss = 0%, RTA = 2.90 ms
[03:22:26] <icinga-wm>	 RECOVERY - Host pc1013 #page is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms
[03:22:26] <icinga-wm>	 RECOVERY - Host ganeti1010 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[03:22:26] <icinga-wm>	 RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[03:22:31] <icinga-wm>	 RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[03:22:31] <icinga-wm>	 RECOVERY - Host db1145 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[03:22:31] <icinga-wm>	 RECOVERY - Host wtp1038 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[03:22:33] <icinga-wm>	 RECOVERY - Host dbproxy1021 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[03:22:33] <icinga-wm>	 RECOVERY - Host aqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[03:22:33] <icinga-wm>	 RECOVERY - Host wtp1039 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[03:22:33] <icinga-wm>	 RECOVERY - Host dbproxy1018 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[03:22:35] <icinga-wm>	 RECOVERY - Host aqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[03:22:35] <icinga-wm>	 RECOVERY - Host wcqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[03:22:35] <icinga-wm>	 RECOVERY - Host dbproxy1019 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[03:22:37] <icinga-wm>	 RECOVERY - Host wtp1037 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[03:22:38] <rzl>	 definitely not out of the woods yet, still looking
[03:22:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:22:41] <icinga-wm>	 RECOVERY - Host kubemaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[03:22:41] <icinga-wm>	 RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[03:22:41] <icinga-wm>	 RECOVERY - Host kubernetes1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[03:22:41] <icinga-wm>	 RECOVERY - Host an-conf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[03:22:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-a-s4_3314: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s1_3311: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s
[03:22:43] <icinga-wm>	 Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s2_3312: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s3_3313: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled: wikireplicas-a-s8_3318: Servers dbproxy1018.eqiad.wmnet 
[03:22:43] <icinga-wm>	 ed down but pooled: wikireplicas-a-s5_3315: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but poole https://wikitech.wikimedia.org/wiki/PyBal
[03:22:43] <icinga-wm>	 RECOVERY - Host ganeti1024 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[03:22:46] <icinga-wm>	 RECOVERY - Host es1022 #page is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[03:22:49] <icinga-wm>	 RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[03:22:49] <icinga-wm>	 RECOVERY - Host db1120 #page is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms
[03:22:50] <icinga-wm>	 RECOVERY - Host db1169 #page is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[03:22:51] <icinga-wm>	 RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[03:22:54] <icinga-wm>	 RECOVERY - Host db1168 #page is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[03:22:54] <icinga-wm>	 RECOVERY - Host logstash1025 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[03:23:01] <icinga-wm>	 RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[03:23:03] <icinga-wm>	 RECOVERY - Host dbproxy1020 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[03:23:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:23:13] <icinga-wm>	 RECOVERY - Host matomo1002 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[03:23:29] <icinga-wm>	 RECOVERY - Host an-tool1007 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[03:23:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 508 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:23:51] <jinxer-wm>	 (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:23:56] <jinxer-wm>	 (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:24:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:24:19] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on db1171 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:24:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s1_3311: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s8_3318: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s
[03:24:35] <icinga-wm>	 Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s3_3313: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s8_3318: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s5_3315: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s1_3311: Servers dbproxy1019.eq
[03:24:35] <icinga-wm>	 t are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s3_3313: Servers dbproxy1018.eqiad.wmnet are marked down https://wikitech.wikimedia.org/wiki/PyBal
[03:24:51] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[03:24:52] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 #page on db1127 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:25:11] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 13 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:25:40] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:25:41] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:25:51] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[03:26:06] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:26:13] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 41 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:26:37] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:26:41] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:26:45] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:26:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:26:54] <Winston_Sung[m]>	 <TimStarling> "OK, that worked, I tested this..." <- Thanks.
[03:27:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:27:41] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:28:01] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[03:28:21] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[03:28:33] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 8: https://wikitech.wikimedia.org/wiki/HAProxy
[03:28:33] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[03:29:19] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 8: https://wikitech.wikimedia.org/wiki/HAProxy
[03:29:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:30:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:30:17] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:28] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[03:33:17] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[03:36:47] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[03:37:27] <rzl>	 !log rzl@dbproxy1018:~$ sudo systemctl reload haproxy
[03:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:39:08] <jinxer-wm>	 (ProbeDown) firing: (3) Service phab1001:443 has failed probes (http_phabricator_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown  - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:39:09] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-19 00:00:01 (3261 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:39:12] <jinxer-wm>	 (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:39:18] <wikibugs>	 (03PS2) 10KartikMistry: Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384)
[03:39:21] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[03:39:23] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[03:43:59] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes1010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:44:04] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:44:14] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:44:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[03:44:34] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:46:55] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[03:47:57] <Kemayo>	 legoktm: Am I wrong to think that wikimediastatus.net isn't very informative on the subject of phab being down despite the topic's claims? It currently just says "all systems operational".
[03:48:01] <icinga-wm>	 RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:48:27] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[03:48:38] <rzl>	 !log rzl@cumin2002:~$ sudo cumin dbproxy[1019,1020,1021].eqiad.wmnet 'systemctl reload haproxy'
[03:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:48:41] <legoktm>	 Kemayo: right, the main status page is for wikis, not supporting services
[03:49:09] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[03:49:19] <rzl>	 phab restored 👍
[03:49:27] <Kemayo>	 Which is fine -- it just feels weird to reference it in the topic as such. :D
[03:50:52] <legoktm>	 yeah, I probably should've thrown in a few more words there, like "for wikis see..."
[03:51:27] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[03:51:46] <jinxer-wm>	 (ProbeDown) resolved: (3) Service phab1001:443 has failed probes (http_phabricator_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown  - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:53:56] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[03:54:30] <rzl>	 logstash: I know, buddy, I know <3
[03:55:33] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[03:56:06] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:56:19] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[03:56:50] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[04:00:39] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:05:23] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:08:13] <rzl>	 !log rzl@kubemaster1002:~$ sudo systemctl restart kube-apiserver
[04:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:10:07] <rzl>	 !log rzl@kubemaster1001:~$ sudo systemctl restart kube-apiserver
[04:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:12:01] <icinga-wm>	 PROBLEM - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100%
[04:14:21] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:21:09] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
[04:43:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
[04:43:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[04:43:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[04:47:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[04:47:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[04:47:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31471 and previous config saved to /var/cache/conftool/dbconfig/20220720-044729-marostegui.json
[04:47:33] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[04:50:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31472 and previous config saved to /var/cache/conftool/dbconfig/20220720-045004-marostegui.json
[04:50:26] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2168 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815427 (https://phabricator.wikimedia.org/T311493)
[04:54:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2168 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815427 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[04:57:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[04:57:59] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[04:59:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2168 to dbctl in s7 and s8 T311493', diff saved to https://phabricator.wikimedia.org/P31473 and previous config saved to /var/cache/conftool/dbconfig/20220720-045918-marostegui.json
[04:59:22] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[05:00:28] <wikibugs>	 (03PS1) 10Marostegui: db2168: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815428 (https://phabricator.wikimedia.org/T311493)
[05:05:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31474 and previous config saved to /var/cache/conftool/dbconfig/20220720-050509-marostegui.json
[05:06:35] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:09:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2168: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815428 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:13:25] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db2167,db2168 [puppet] - 10https://gerrit.wikimedia.org/r/815430 (https://phabricator.wikimedia.org/T311493)
[05:14:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2167,db2168 [puppet] - 10https://gerrit.wikimedia.org/r/815430 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:20:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31475 and previous config saved to /var/cache/conftool/dbconfig/20220720-052014-marostegui.json
[05:26:34] <marostegui>	 !log Stop mysql on db2087 (s6 and s7) to clone db2169 T311493
[05:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:38] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[05:27:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Switch testwiki to multi-DC active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling)
[05:28:03] <wikibugs>	 (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815516 (https://phabricator.wikimedia.org/T311493)
[05:29:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815516 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:35:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31478 and previous config saved to /var/cache/conftool/dbconfig/20220720-053520-marostegui.json
[05:35:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:35:24] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[05:35:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:36:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[05:36:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[05:36:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31479 and previous config saved to /var/cache/conftool/dbconfig/20220720-053620-marostegui.json
[05:37:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31480 and previous config saved to /var/cache/conftool/dbconfig/20220720-053751-marostegui.json
[05:40:39] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20
[05:40:39] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[05:44:43] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:52:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31481 and previous config saved to /var/cache/conftool/dbconfig/20220720-055256-marostegui.json
[05:56:25] <icinga-wm>	 PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:09] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20
[06:03:09] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[06:08:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31482 and previous config saved to /var/cache/conftool/dbconfig/20220720-060802-marostegui.json
[06:21:47] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31483 and previous config saved to /var/cache/conftool/dbconfig/20220720-062307-marostegui.json
[06:23:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[06:23:13] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[06:23:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[06:23:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312990)', diff saved to https://phabricator.wikimedia.org/P31484 and previous config saved to /var/cache/conftool/dbconfig/20220720-062327-marostegui.json
[06:25:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312990)', diff saved to https://phabricator.wikimedia.org/P31485 and previous config saved to /var/cache/conftool/dbconfig/20220720-062539-marostegui.json
[06:28:14] <wikibugs>	 (03PS1) 10PleaseStand: SecurePoll: Adding files for 2022 vote [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815411 (https://phabricator.wikimedia.org/T309753)
[06:29:30] <wikibugs>	 (03PS1) 10PleaseStand: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815412 (https://phabricator.wikimedia.org/T309753)
[06:30:18] <wikibugs>	 (03PS1) 10PleaseStand: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815413 (https://phabricator.wikimedia.org/T309753)
[06:40:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31486 and previous config saved to /var/cache/conftool/dbconfig/20220720-064044-marostegui.json
[06:41:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[06:41:21] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[06:41:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[06:43:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2029.codfw.wmnet with OS bullseye
[06:43:40] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2029.codfw.wmnet with OS bullseye
[06:55:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31487 and previous config saved to /var/cache/conftool/dbconfig/20220720-065549-marostegui.json
[06:57:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2029.codfw.wmnet with reason: host reimage
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T0700). Please do the needful.
[07:00:05] <jouncebot>	 Sohom_Datta, kart_, PleaseStand, PleaseStand, and PleaseStand: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:38] * kart_ is here. Sorry for delay
[07:02:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2029.codfw.wmnet with reason: host reimage
[07:03:14] <PleaseStand>	 Amir1: hi
[07:03:20] <Sohom_Datta>	 I'm here :)
[07:03:48] <kart_>	 cool. Amir1 urbanecm Are you doing deployments?
[07:05:30] <kart_>	 OK. I can quickly deploy my change first while we are waiting for Amir1 / urbanecm 
[07:06:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nicely done! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[07:06:58] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry)
[07:07:55] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry)
[07:09:12] <taavi>	 o/ I can deploy today
[07:09:45] <kart_>	 taavi: I'm deploying my change, will let you know once done.
[07:09:55] <taavi>	 sure
[07:10:54] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2169 [puppet] - 10https://gerrit.wikimedia.org/r/815677 (https://phabricator.wikimedia.org/T311493)
[07:10:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312990)', diff saved to https://phabricator.wikimedia.org/P31488 and previous config saved to /var/cache/conftool/dbconfig/20220720-071054-marostegui.json
[07:10:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[07:10:59] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[07:11:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[07:11:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T312990)', diff saved to https://phabricator.wikimedia.org/P31489 and previous config saved to /var/cache/conftool/dbconfig/20220720-071114-marostegui.json
[07:12:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:14:53] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815251|Enable ContentTranslation out of Beta for sswiki (T309384)]] (duration: 03m 24s)
[07:14:57] <stashbot>	 T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384
[07:15:40] <kart_>	 taavi: I'm done.
[07:16:19] <wikibugs>	 (03PS5) 10Majavah: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[07:16:41] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[07:17:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10User-jbond: fetch_external_clouds_vendors_nets.py fails to update DigitalOcean network ranges - https://phabricator.wikimedia.org/T313206 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez DigitalOcean restored the CSV and it's now working as...
[07:17:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Vgutierrez)
[07:17:36] <wikibugs>	 (03Merged) 10jenkins-bot: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta)
[07:17:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] role::beta::docker_services: prune docker images [puppet] - 10https://gerrit.wikimedia.org/r/815335 (https://phabricator.wikimedia.org/T313334) (owner: 10Ori)
[07:18:25] <taavi>	 Sohom_Datta: merged your patch, it'll be automatically deployed to the beta cluster in the next 30 mins or so, ping me if it doesn't
[07:18:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2029.codfw.wmnet with OS bullseye
[07:18:37] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2029.codfw.wmnet with OS bullseye completed: - ganeti2029 (**PASS**)   - Downtimed on...
[07:18:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:18:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:19:05] <wikibugs>	 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10fgiunchedi) >>! In T313102#8088079, @MatthewVernon wrote: > Are there other teams you think we should talk to before turning this off, then?  Indeed, I know @hnowla...
[07:19:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312990)', diff saved to https://phabricator.wikimedia.org/P31490 and previous config saved to /var/cache/conftool/dbconfig/20220720-071927-marostegui.json
[07:19:28] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] SecurePoll: Adding files for 2022 vote [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815411 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand)
[07:19:30] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815412 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand)
[07:19:31] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[07:19:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815413 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand)
[07:19:54] <taavi>	 PleaseStand: I'm guessing your patches can't really be tested?
[07:20:30] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 101.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[07:20:37] <Sohom_Datta>	 Thanks a bunch, will let you know :)
[07:20:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Vgutierrez) >>! In T313213#8089900, @AAlikhan wrote: > I'm approving this request for @soworu. Let me know if there's anything beyond this comment that I need to do to suppor...
[07:21:10] <PleaseStand>	 taavi: I don't have production shell access, and probably don't have beta cluster shell access either
[07:21:49] <wikibugs>	 (03Merged) 10jenkins-bot: SecurePoll: Adding files for 2022 vote [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815411 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand)
[07:21:51] <wikibugs>	 (03Merged) 10jenkins-bot: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815412 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand)
[07:21:58] <wikibugs>	 (03Merged) 10jenkins-bot: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815413 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand)
[07:22:09] <taavi>	 I know, I'm asking if there's anything that needs to be done to your patches before I sync them to the prod cluster
[07:22:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:23:11] <PleaseStand>	 taavi: Should be OK, it's only a maintenance script that would be run manually, probably by foks
[07:23:48] <taavi>	 ok, thanks
[07:23:53] <foks>	 yup that is correct
[07:24:57] <wikibugs>	 (03CR) 10David Caro: wmcs: don't page for most checks (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro)
[07:25:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable x509 CN validation in blackbox (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:26:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: enable x509 CN validation in blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847)
[07:26:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:26:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes1010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:26:58] <logmsgbot>	 !log taavi@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/SecurePoll/cli/wm-scripts/bv2022/: T309753 backports (duration: 02m 57s)
[07:27:02] <stashbot>	 T309753: Create SecurePoll voter list for 2022 board vote - https://phabricator.wikimedia.org/T309753
[07:27:29] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/815680 (https://phabricator.wikimedia.org/T313382)
[07:27:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:28:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:28:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:29:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:30:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) Critical DB infra there:  - dbproxy1020 (m3 current proxy): needs failover. - pc1013 active pc3 master: needs failover - db1181 s7 master: needs failover T313383...
[07:30:22] <jayme>	 !log kubernetes1010.eqiad.wmnet,kubernetes1020.eqiad.wmnet 'systemctl restart rsyslog'
[07:30:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui)
[07:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:40] <logmsgbot>	 !log taavi@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/SecurePoll/cli/wm-scripts/bv2022/populateEditCount.php: T309753 backports (duration: 02m 54s)
[07:30:53] <taavi>	 PleaseStand: ok, that should be everything synced
[07:30:59] <taavi>	 anyone have anything else to deploy?
[07:31:28] <jayme>	 !log ml-serve1002.eqiad.wmnet,ml-serve1004.eqiad.wmnet 'systemctl restart rsyslog'
[07:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:45] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) p:05Triage→03High
[07:32:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2169 [puppet] - 10https://gerrit.wikimedia.org/r/815677 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:33:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui)
[07:33:37] <wikibugs>	 (03PS5) 10JMeybohm: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661)
[07:34:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31491 and previous config saved to /var/cache/conftool/dbconfig/20220720-073432-marostegui.json
[07:34:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:35:23] <wikibugs>	 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10JMeybohm) 05Open→03Resolved
[07:35:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:35:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:35:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:35:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes1010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:36:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:39:23] <wikibugs>	 (03PS1) 10Majavah: P:openstack::cinder: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815681
[07:41:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use protocol in blackbox target files [puppet] - 10https://gerrit.wikimedia.org/r/815305 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:41:18] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: use protocol in blackbox target files [puppet] - 10https://gerrit.wikimedia.org/r/815305 (https://phabricator.wikimedia.org/T305847)
[07:41:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[07:41:35] <wikibugs>	 (03PS2) 10Majavah: P:openstack::cinder: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815681
[07:42:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36315/console" [puppet] - 10https://gerrit.wikimedia.org/r/815681 (owner: 10Majavah)
[07:42:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] prometheus: use protocol in blackbox target files [puppet] - 10https://gerrit.wikimedia.org/r/815305 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:44:06] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Vgutierrez) maybe @Ladsgroup would be a better fit to help here, but meanwhile, could you provide some details like which mailing list are you referring to? Thanks
[07:46:00] <wikibugs>	 (03PS1) 10Majavah: P:openstack::designate: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815683
[07:47:14] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36316/console" [puppet] - 10https://gerrit.wikimedia.org/r/815683 (owner: 10Majavah)
[07:47:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[07:47:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) This didn't get caught by monitoring. We have a LibreNMS alert that triggers when any "emergency" log is sent by a device, but loo...
[07:49:21] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Peachey88)
[07:49:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31492 and previous config saved to /var/cache/conftool/dbconfig/20220720-074937-marostegui.json
[07:54:06] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/815680 (https://phabricator.wikimedia.org/T313382) (owner: 10Marostegui)
[07:59:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:59:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add blackbox TCP check [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:59:32] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add blackbox TCP check [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847)
[08:00:05] <jouncebot>	 jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T0800).
[08:02:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:04:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:04:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312990)', diff saved to https://phabricator.wikimedia.org/P31493 and previous config saved to /var/cache/conftool/dbconfig/20220720-080442-marostegui.json
[08:04:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:04:47] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[08:04:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:05:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:05:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:05:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312990)', diff saved to https://phabricator.wikimedia.org/P31494 and previous config saved to /var/cache/conftool/dbconfig/20220720-080509-marostegui.json
[08:07:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312990)', diff saved to https://phabricator.wikimedia.org/P31495 and previous config saved to /var/cache/conftool/dbconfig/20220720-080721-marostegui.json
[08:08:14] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:08:16] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:09:37] <wikibugs>	 (03CR) 10Volans: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond)
[08:10:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847)
[08:11:06] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi)
[08:11:24] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Aklapper) 05Open→03Stalled Also, what exactly is an "id"? What does https://lists.wikimedia.org/user-profile/ say?
[08:12:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:12:41] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847)
[08:12:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: syslog: probe TLS endpoint with blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847)
[08:12:45] <wikibugs>	 (03PS6) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847)
[08:12:47] <wikibugs>	 (03PS14) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815
[08:14:07] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi)
[08:14:13] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi)
[08:14:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi)
[08:14:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[08:14:54] <elukey>	 !log apt-get clean on archiva1002 to free some space
[08:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ayounsi)
[08:16:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ayounsi)
[08:19:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:22:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31496 and previous config saved to /var/cache/conftool/dbconfig/20220720-082226-marostegui.json
[08:23:01] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10elukey)
[08:34:33] <wikibugs>	 (03PS7) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710)
[08:34:49] <wikibugs>	 (03CR) 10Ayounsi: Netbox _get_circuits: add patch panel support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[08:37:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31497 and previous config saved to /var/cache/conftool/dbconfig/20220720-083731-marostegui.json
[08:43:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Resolved→03Open Since the replacement errors rate on one of the interfaces went though the roof: https://librenms.wikimedia.org/graphs/to=1658306...
[08:48:24] <wikibugs>	 (03CR) 10Jbond: prometheus: enable x509 CN validation in blackbox (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:50:45] <wikibugs>	 (03CR) 10Jbond: prometheus: enable x509 CN validation in blackbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:52:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312990)', diff saved to https://phabricator.wikimedia.org/P31498 and previous config saved to /var/cache/conftool/dbconfig/20220720-085236-marostegui.json
[08:52:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[08:52:43] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[08:52:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[08:52:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31499 and previous config saved to /var/cache/conftool/dbconfig/20220720-085256-marostegui.json
[09:00:28] <wikibugs>	 (03CR) 10Ayounsi: provision cookbook: configure switches using cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi)
[09:03:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[09:03:38] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond)
[09:07:32] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond)
[09:08:09] <wikibugs>	 (03CR) 10Jbond: redfish: add a fqdn getter property and __str__ method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond)
[09:08:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond)
[09:08:21] <wikibugs>	 (03PS8) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968
[09:09:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond)
[09:09:26] <wikibugs>	 (03PS6) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969
[09:10:05] <wikibugs>	 (03PS1) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982)
[09:10:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey)
[09:13:59] <wikibugs>	 (03PS5) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970
[09:14:01] <wikibugs>	 (03PS5) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971
[09:14:29] <wikibugs>	 (03PS6) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970
[09:14:33] <wikibugs>	 (03PS6) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971
[09:14:51] <wikibugs>	 (03PS2) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982)
[09:15:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) Opened high severity JTAC case 2022-0720-513915. In the meantime we need to discuss if we want to preemptively replace FPC5 with a...
[09:15:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey)
[09:17:53] <wikibugs>	 (03CR) 10Jbond: remote: add an __iter__ to RemoteHosts (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[09:19:27] <wikibugs>	 (03CR) 10Volans: "replies inline, LGTM otherwise" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond)
[09:21:43] <wikibugs>	 (03CR) 10Jbond: redfish: Add property for the HttpPushURI (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond)
[09:21:47] <wikibugs>	 (03CR) 10Jbond: redfish: add a generation property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond)
[09:22:13] <wikibugs>	 (03PS2) 10Jbond: remote: add an __iter__ to RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243
[09:22:47] <wikibugs>	 (03PS2) 10Volans: config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi)
[09:26:29] <wikibugs>	 (03PS7) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971
[09:29:01] <wikibugs>	 (03CR) 10Jbond: redfish: add wait for reboot function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond)
[09:29:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond)
[09:29:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond)
[09:30:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi)
[09:31:08] <hashar>	 jbond: XioNoX: hi, I am wondering whether we should move wikibugs notifications for homer/spicerack to another channel? :]
[09:31:46] <XioNoX>	 I /ignore wikibugs so dunno what you're talking about :)
[09:31:51] <hashar>	 ahah
[09:31:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond)
[09:32:30] <XioNoX>	 (/cc volans)
[09:33:58] <wikibugs>	 (03PS7) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969
[09:34:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond)
[09:34:10] <wikibugs>	 (03PS7) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970
[09:34:15] <wikibugs>	 (03PS8) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971
[09:35:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond)
[09:35:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond)
[09:37:48] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond)
[09:41:23] <hashar>	 https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/wikibugs2/+/refs/heads/master/gerrit-channels.yaml#162 has a catch all for `operations/` repos
[09:41:45] <wikibugs>	 (03PS3) 10Volans: config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi)
[09:44:54] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond)
[09:45:13] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond)
[09:46:26] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond)
[09:46:46] <wikibugs>	 (03PS2) 10David Caro: tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238
[09:46:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond)
[09:47:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond)
[09:47:16] <wikibugs>	 (03PS3) 10Jbond: remote: add an __iter__ to RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243
[09:48:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi)
[09:49:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s/reboot-nodes: Error if nodes are cordoned (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:52:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[09:52:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[09:52:39] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[09:53:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31501 and previous config saved to /var/cache/conftool/dbconfig/20220720-095310-marostegui.json
[09:53:14] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[09:53:24] <wikibugs>	 (03PS3) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982)
[09:54:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to cluster codfw and group A
[09:55:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2029.codfw.wmnet to cluster codfw and group A
[09:55:47] <wikibugs>	 (03CR) 10Volans: remote: add an __iter__ to RemoteHosts (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[09:56:15] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond)
[09:56:17] <wikibugs>	 (03Merged) 10jenkins-bot: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:58:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi)
[10:01:35] <wikibugs>	 (03CR) 10Jbond: "lgtm optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:02:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi)
[10:03:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:04:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[10:04:34] <wikibugs>	 (03PS3) 10Volans: Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi)
[10:06:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi)
[10:07:07] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[10:08:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31502 and previous config saved to /var/cache/conftool/dbconfig/20220720-100815-marostegui.json
[10:08:33] <wikibugs>	 (03PS4) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982)
[10:08:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi)
[10:09:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2020.codfw.wmnet with OS bullseye
[10:09:16] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bullseye
[10:09:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[10:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi)
[10:13:13] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey)
[10:13:29] <wikibugs>	 (03PS8) 10Volans: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi)
[10:13:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686
[10:13:36] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[10:13:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686
[10:21:03] <wikibugs>	 (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694
[10:23:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31503 and previous config saved to /var/cache/conftool/dbconfig/20220720-102320-marostegui.json
[10:24:49] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847)
[10:24:51] <wikibugs>	 (03PS3) 10Filippo Giunchedi: syslog: probe TLS endpoint with blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847)
[10:24:53] <wikibugs>	 (03PS7) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847)
[10:24:55] <wikibugs>	 (03PS15) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815
[10:24:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:25:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage
[10:26:22] <wikibugs>	 (03CR) 10Volans: "LGTM, minor nits inline" [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi)
[10:27:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage
[10:30:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686
[10:30:17] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[10:30:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686
[10:31:23] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[10:31:27] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "thanks for the comment see in line, i have also -1 this as i remembered it is still a bit incomplete as it fails to mock ocsp response fil" [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond)
[10:31:50] <wikibugs>	 (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694
[10:32:13] <wikibugs>	 (03CR) 10Ayounsi: "Thanks!" [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi)
[10:34:03] <wikibugs>	 (03PS3) 10Ayounsi: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694
[10:35:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: enable x509 CN validation in blackbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:37:20] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond)
[10:38:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31504 and previous config saved to /var/cache/conftool/dbconfig/20220720-103825-marostegui.json
[10:38:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:38:31] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[10:38:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi)
[10:38:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:39:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[10:39:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[10:39:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:39:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance
[10:40:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance
[10:40:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on 12 hosts with reason: Maintenance
[10:40:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 12 hosts with reason: Maintenance
[10:43:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2020.codfw.wmnet with OS bullseye
[10:43:25] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bullseye completed: - ganeti2020 (**PASS**)   - Downtimed on...
[10:44:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi)
[10:45:18] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi)
[10:49:11] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/815696
[10:50:15] <wikibugs>	 (03PS41) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215
[10:51:22] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[10:51:56] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v3.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/815696 (owner: 10Volans)
[10:56:06] <wikibugs>	 (03PS1) 10Ayounsi: Release v0.5.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/815698
[10:57:23] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 4 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[10:57:25] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36317/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond)
[10:59:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/815698 (owner: 10Ayounsi)
[10:59:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
[11:01:12] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond)
[11:02:38] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/815696 (owner: 10Volans)
[11:03:00] <moritzm>	 !log draining ganeti2014 T310483
[11:03:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:56] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Release v0.5.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/815698 (owner: 10Ayounsi)
[11:05:22] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.5.1 - ayounsi@cumin1001
[11:06:07] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.5.1 - ayounsi@cumin1001
[11:09:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
[11:16:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/815680 (https://phabricator.wikimedia.org/T313382) (owner: 10Marostegui)
[11:17:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2009.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[11:17:27] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[11:17:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2009.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[11:17:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) m3-master dbproxy has been failed over.
[11:25:09] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[11:34:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[11:34:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:34:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:34:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T312990)', diff saved to https://phabricator.wikimedia.org/P31506 and previous config saved to /var/cache/conftool/dbconfig/20220720-113424-marostegui.json
[11:34:28] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[11:34:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10dcaro)
[11:35:07] <icinga-wm>	 RECOVERY - Check systemd state on parse2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:00] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) I tried flamegraph on the host again today.  With 500* concurrent threads where it basically never stops accepting conne...
[11:45:07] <wikibugs>	 (03CR) 10LSobanski: "Side question, what's the definition of "stable" that would prompt the move to -operations?" [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn)
[11:50:27] <wikibugs>	 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10LSobanski) Considering the plan to migrate away from miscweb, are there any reasons not to deploy this to K8s from the get go?
[11:52:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312990)', diff saved to https://phabricator.wikimedia.org/P31507 and previous config saved to /var/cache/conftool/dbconfig/20220720-115233-marostegui.json
[11:52:38] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[11:54:09] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:58] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab_runner: Allow DNS requests from GitLab runner containers in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/812264 (https://phabricator.wikimedia.org/T311241) (owner: 10Jelto)
[12:07:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P31509 and previous config saved to /var/cache/conftool/dbconfig/20220720-120738-marostegui.json
[12:13:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[12:13:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] syslog: probe TLS endpoint with blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[12:14:17] <wikibugs>	 (03PS1) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400)
[12:15:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro)
[12:17:37] <wikibugs>	 (03PS42) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215
[12:19:57] <icinga-wm>	 RECOVERY - Host analytics1068 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[12:22:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P31510 and previous config saved to /var/cache/conftool/dbconfig/20220720-122246-marostegui.json
[12:25:40] <wikibugs>	 (03PS2) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400)
[12:26:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro)
[12:26:45] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664 (owner: 10Urbanecm)
[12:27:46] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: end mailing list campaign in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno)
[12:29:07] <marostegui>	 !log Move pc1014 from pc2 to pc3 T313401
[12:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:12] <stashbot>	 T313401: Move pc1014 from pc2 to pc3 - https://phabricator.wikimedia.org/T313401
[12:29:24] <wikibugs>	 (03PS3) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400)
[12:30:23] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Move it from pc2 to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/815706 (https://phabricator.wikimedia.org/T313401)
[12:30:48] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36320/console" [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro)
[12:32:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Move it from pc2 to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/815706 (https://phabricator.wikimedia.org/T313401) (owner: 10Marostegui)
[12:33:26] <wikibugs>	 (03PS16) 10Filippo Giunchedi: mw_rc_irc: check ircd availability with blackbox prober [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847)
[12:36:46] <wikibugs>	 (03PS43) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215
[12:37:09] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "the PCC looks good, will deploy one-by-one" [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro)
[12:37:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312990)', diff saved to https://phabricator.wikimedia.org/P31511 and previous config saved to /var/cache/conftool/dbconfig/20220720-123751-marostegui.json
[12:37:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[12:37:58] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[12:38:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[12:39:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:39:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:40:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:40:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:40:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31512 and previous config saved to /var/cache/conftool/dbconfig/20220720-124042-marostegui.json
[12:41:34] <wikibugs>	 (03PS4) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400)
[12:43:26] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36321/console" [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro)
[12:43:38] <wikibugs>	 (03PS44) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215
[12:44:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31513 and previous config saved to /var/cache/conftool/dbconfig/20220720-124453-marostegui.json
[12:45:01] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[12:46:38] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro)
[12:49:12] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover s7 master [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383)
[12:49:38] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[12:50:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10bking) a:05Papaul→03bking
[12:50:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10bking) I think there's a cookbook that can fix this, will grab the ticket and give it a shot.
[12:51:02] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383)
[12:52:43] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[12:58:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover s7 master [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[13:00:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P31514 and previous config saved to /var/cache/conftool/dbconfig/20220720-130000-marostegui.json
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:21] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10dcausse) We might perhaps be able to drop all wdqs artifacts prior to 0.3.40, this is the oldest reference I found here: https://github.com/wmde/wikibase-relea...
[13:00:21] <Lucas_WMDE>	 I think I’d like to backport a fix
[13:00:28] <Lucas_WMDE>	 but I could also use a break first ^^
[13:00:34] <Lucas_WMDE>	 so maybe a bit later in the window
[13:00:43] <Lucas_WMDE>	 (I’ll add it to the wikitech page if anything happens)
[13:03:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1026 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:23] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is CRITICAL: connect to address 10.64.48.182 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[13:03:31] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:04:09] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:04:13] <wikibugs>	 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10CDanis) filter_victorops_calendar requires some persistent storage, ideally a plain filesystem although we could figure out something els...
[13:06:49] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116)
[13:07:09] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116)
[13:07:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[13:08:05] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[13:08:27] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383)
[13:08:38] <wikibugs>	 (03CR) 10Marostegui: mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[13:09:15] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE))
[13:10:25] <icinga-wm>	 RECOVERY - Check systemd state on restbase1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:47] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:11:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE))
[13:11:24] <Lucas_WMDE>	 alright, let’s start backporting
[13:11:25] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2023-04-14 11:21:30 +0000 (expires in 267 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:11:58] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "(Sync src/ first, then WikibaseLexeme.resources.php.)" [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE))
[13:12:41] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui)
[13:13:03] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.182 port 9042 https://phabricator.wikimedia.org/T93886
[13:15:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P31515 and previous config saved to /var/cache/conftool/dbconfig/20220720-131505-marostegui.json
[13:15:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2034.codfw.wmnet with OS bullseye
[13:15:41] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2034.codfw.wmnet with OS bullseye
[13:17:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix blackbox timeout Pattern [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847)
[13:17:50] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10phaultfinder)
[13:18:26] * MichaelG_WMDE is also here and ready to support testing of that backport
[13:18:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36322/console" [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:19:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: fix blackbox timeout Pattern [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:19:10] <godog>	 taavi: ^
[13:20:12] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36323/console" [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:20:22] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Open→03Resolved Nevermind, tracked in T313337
[13:20:53] <taavi>	 godog: thanks! lgtm
[13:21:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix blackbox timeout Pattern [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:21:54] <godog>	 taavi: cheers, thanks for letting me know
[13:22:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] admin/gerrit: add gerrit shell admins on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815402 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[13:23:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add additional interpreters, drop 3.6 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715
[13:26:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add additional interpreters, drop 3.6 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 (owner: 10Giuseppe Lavagetto)
[13:27:14] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE))
[13:27:28] <wikibugs>	 (03CR) 10Hashar: "I have asked Valentin to deploy this now so I can get access to gerrit2002.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/815402 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[13:27:43] <wikibugs>	 (03Merged) 10jenkins-bot: Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE))
[13:28:43] <Lucas_WMDE>	 okay, wmf.21 backport should be on mwdebug1001, let’s test it (cc MichaelG_WMDE)
[13:28:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 231, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:29:04] * MichaelG_WMDE looks
[13:29:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2034.codfw.wmnet with reason: host reimage
[13:30:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31516 and previous config saved to /var/cache/conftool/dbconfig/20220720-133010-marostegui.json
[13:30:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[13:30:14] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[13:30:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[13:30:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T312990)', diff saved to https://phabricator.wikimedia.org/P31517 and previous config saved to /var/cache/conftool/dbconfig/20220720-133030-marostegui.json
[13:31:10] <MichaelG_WMDE>	 works for me! @Lucas_WMDE 
[13:31:18] <Lucas_WMDE>	 same here, thanks for testing!
[13:33:08] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2034.codfw.wmnet with reason: host reimage
[13:33:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312990)', diff saved to https://phabricator.wikimedia.org/P31518 and previous config saved to /var/cache/conftool/dbconfig/20220720-133336-marostegui.json
[13:33:50] <XioNoX>	 !log cr2-eqiad# deactivate interfaces xe-3/3/0 - 
[13:33:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:53] <XioNoX>	 !log cr2-eqiad# deactivate interfaces xe-3/3/0 - T313337
[13:33:56] <Lucas_WMDE>	 hmm, a scap warning about “restart-php-fmp-all … called with an empty host list”
[13:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:58] <stashbot>	 T313337: Inbound interface errors - https://phabricator.wikimedia.org/T313337
[13:34:39] <Lucas_WMDE>	 I guess that’s fine because the second sync in a moment will restart php-fpm again (and it looks like the file itself was synced)
[13:35:24] <moritzm>	 !log installing request-tracker4 security updates
[13:35:26] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10ayounsi) Looks like two interfaces are/were showing errors: cr2-eqiad:xe-3/0/3 - remote side seeing inbound errors: https://librenms.wikimedia.org/graphs/to=1658306400/id=12731/type=port_errors/from=1658133600/ I re-enabled the...
[13:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:35:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/WikibaseLexeme/src/MediaWiki/Config/LexemeLanguageCodePropertyIdConfig.php: Backport: [[gerrit:815425|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (1/2) (duration: 03m 34s)
[13:36:00] <stashbot>	 T313116: Special:NewLexemeAlpha doesn’t work on mobile - https://phabricator.wikimedia.org/T313116
[13:36:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:36:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:37:08] <Lucas_WMDE>	 same mesage again
[13:37:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:37:34] <Lucas_WMDE>	 it doesn’t mention which host it’s for (if any), but the scap still takes the usual amount of time after that
[13:37:43] <Lucas_WMDE>	 so to me it feels like (most?) hosts are still getting their php-fpm restarted
[13:38:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[13:38:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[13:39:19] <MichaelG_WMDE>	 to me it works on multiple different hosts
[13:39:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/WikibaseLexeme/WikibaseLexeme.resources.php: Backport: [[gerrit:815425|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (2/2) (duration: 03m 08s)
[13:39:45] <Lucas_WMDE>	 https://test.m.wikidata.org/wiki/Special:NewLexemeAlpha works for me without mwdebug, at least
[13:39:53] <Lucas_WMDE>	 I guess I won’t worry about it then
[13:40:16] <Lucas_WMDE>	 but I’ll paste the full message for reference:
[13:40:20] <Lucas_WMDE>	 `Job /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.2-fpm 9223372036854775807 called with an empty host list.`
[13:40:33] <RhinosF1>	 Lucas_WMDE: scap was updated yesterday
[13:44:24] <wikibugs>	 (03Merged) 10jenkins-bot: Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE))
[13:45:24] <moritzm>	 !log installing containerd security updates in Kubernetes codfw cluster
[13:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:40] <Lucas_WMDE>	 jnuche or jeena: any idea why wmf.19/extensions/WikibaseLexeme/ is “in the middle of an am session”?
[13:47:55] <Lucas_WMDE>	 (asking you since your names appear in ls -l .git/modules/extensions/WikibaseLexeme)
[13:48:39] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10ayounsi) p:05Triage→03High a:03Cmjohnson
[13:48:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31519 and previous config saved to /var/cache/conftool/dbconfig/20220720-134841-marostegui.json
[13:48:47] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2034.codfw.wmnet with OS bullseye
[13:48:53] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2034.codfw.wmnet with OS bullseye completed: - elastic2034 (**PAS...
[13:49:43] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10BBlack) a:03Jdforrester-WMF Hi - the process for the public certs+DN...
[13:52:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:53:02] <icinga-wm>	 PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:53:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:53:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10Volans) I was able to fix `elastic2067` via local IPMI. I've added the following sections to wikitech: * https://wikitech.wikimedia.org/wiki/Management_Interfaces#I...
[13:54:10] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@deploy1002 /srv/mediawiki-staging (master $ u=) $ git -C php-1.39.0-wmf.19/extensions/WikibaseLexeme am --skip # T308659 backport already applied
[13:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:14] <stashbot>	 T308659: Validate lemma length in Special:NewLexeme(Alpha) and label/description/aliases length in Special:NewProperty (CVE-2022-34750) - https://phabricator.wikimedia.org/T308659
[13:54:49] <Lucas_WMDE>	 okay, now the submodule update works
[13:54:55] <Lucas_WMDE>	 pulled to mwdebug1001 (cc MichaelG_WMDE)
[13:55:12] * MichaelG_WMDE looks
[13:55:58] <Lucas_WMDE>	 seems to load here, at least (I don’t really want to actually create a lexeme)
[13:56:09] <MichaelG_WMDE>	 looks good to me (not creating a test Lexeme here, because this is production^^)
[13:56:14] <Lucas_WMDE>	 ^^
[13:57:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:57:28] <jnuche>	 Lucas_WMDE: there was a merge conflict when applying security patches during the train deployment of wmf.19
[13:57:39] <jnuche>	 I think that's what you were seeing
[13:57:46] <Lucas_WMDE>	 yes, I think I resolved it now
[13:57:54] <jnuche>	 ok, sorry about that
[13:59:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/WikibaseLexeme/src/MediaWiki/Config/LexemeLanguageCodePropertyIdConfig.php: Backport: [[gerrit:815726|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (1/2) (duration: 02m 56s)
[13:59:53] <stashbot>	 T313116: Special:NewLexemeAlpha doesn’t work on mobile - https://phabricator.wikimedia.org/T313116
[14:00:55] <wikibugs>	 (03CR) 10David Caro: "Sorry for the partial review, things are quite busy lately" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:01:12] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) >>! In T313227#8091301, @BBlack wrote: > Hi - the pro...
[14:01:16] <wikibugs>	 (03CR) 10David Caro: "Sorry for the partial review, things are quite busy." [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:02:48] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10cmooney) Agreed this is a good idea.  I can see why it may have been "left alone" previously but given we'd had issues best to bite the bullet and do it.  The 40G u...
[14:02:50] <jbond>	 !log disable puppet on A:cp to deplot Gerrit:768766
[14:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10Volans) Updated the comment above as I made the command safer directly in the docs :)
[14:03:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/WikibaseLexeme/WikibaseLexeme.resources.php: Backport: [[gerrit:815726|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (2/2) (duration: 03m 02s)
[14:03:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31520 and previous config saved to /var/cache/conftool/dbconfig/20220720-140346-marostegui.json
[14:04:05] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:48] <wikibugs>	 (03PS1) 10Volans: Upstream release v3.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/815722
[14:06:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v3.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/815722 (owner: 10Volans)
[14:07:54] <wikibugs>	 (03CR) 10Ori: [C: 03+2] role::beta::docker_services: prune docker images [puppet] - 10https://gerrit.wikimedia.org/r/815335 (https://phabricator.wikimedia.org/T313334) (owner: 10Ori)
[14:09:01] <wikibugs>	 (03PS2) 10Marostegui: Put cloudweb100[34] into service [puppet] - 10https://gerrit.wikimedia.org/r/815378 (https://phabricator.wikimedia.org/T305414) (owner: 10Andrew Bogott)
[14:11:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715
[14:13:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 (owner: 10Giuseppe Lavagetto)
[14:13:27] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v3.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/815722 (owner: 10Volans)
[14:17:24] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[14:17:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312990)', diff saved to https://phabricator.wikimedia.org/P31521 and previous config saved to /var/cache/conftool/dbconfig/20220720-141851-marostegui.json
[14:18:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[14:18:57] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[14:19:07] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[14:19:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond)
[14:19:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T312990)', diff saved to https://phabricator.wikimedia.org/P31522 and previous config saved to /var/cache/conftool/dbconfig/20220720-141912-marostegui.json
[14:21:21] <wikibugs>	 (03PS1) 10Majavah: P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723
[14:22:14] <wikibugs>	 (03PS1) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724
[14:22:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312990)', diff saved to https://phabricator.wikimedia.org/P31523 and previous config saved to /var/cache/conftool/dbconfig/20220720-142214-marostegui.json
[14:23:38] <wikibugs>	 (03PS2) 10Majavah: P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723
[14:24:35] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36326/console" [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah)
[14:25:14] <wikibugs>	 (03PS1) 10Jbond: Revert "P:varnish::common: Add support for passing wikimedia_domains" [puppet] - 10https://gerrit.wikimedia.org/r/815727
[14:26:08] <moritzm>	 !log installing containerd security updates in Kubernetes codfw masters
[14:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:varnish::common: Add support for passing wikimedia_domains" [puppet] - 10https://gerrit.wikimedia.org/r/815727 (owner: 10Jbond)
[14:26:45] <wikibugs>	 (03PS1) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728
[14:26:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans)
[14:29:46] <icinga-wm>	 PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:34] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715
[14:30:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "This looks good, and pcc agrees." [puppet] - 10https://gerrit.wikimedia.org/r/815681 (owner: 10Majavah)
[14:36:00] <volans>	 !log uploaded spicerack_3.1.0 to apt.wikimedia.org bullseye-wikimedia
[14:36:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 (owner: 10Giuseppe Lavagetto)
[14:37:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31524 and previous config saved to /var/cache/conftool/dbconfig/20220720-143719-marostegui.json
[14:37:24] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[14:42:22] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[14:44:06] <volans>	 !log installing spicearck 3.1.0 on cumin2002
[14:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:31] <wikibugs>	 (03CR) 10Ahmon Dancy: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[14:50:38] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Avoid additional errors if connection to poolcounter server fails [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy)
[14:50:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Handle socket.timeout the same way as TimeoutError [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814893 (owner: 10Ahmon Dancy)
[14:50:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Raise the default connection timeout to 2 seconds [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815747 (https://phabricator.wikimedia.org/T310835)
[14:50:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New version 0.0.3 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815748
[14:50:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New package version 0.0.3-1 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815749
[14:52:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31527 and previous config saved to /var/cache/conftool/dbconfig/20220720-145224-marostegui.json
[14:52:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Avoid additional errors if connection to poolcounter server fails (031 comment) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy)
[14:53:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815683 (owner: 10Majavah)
[14:53:42] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:24] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah)
[14:56:38] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Raise the default connection timeout to 2 seconds [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815747 (https://phabricator.wikimedia.org/T310835) (owner: 10Giuseppe Lavagetto)
[14:59:02] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[15:03:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:03:19] <jinxer-wm>	 (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:04:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2036.codfw.wmnet with OS bullseye
[15:04:27] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye
[15:04:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[15:04:40] <rzl>	 three minutes into the shift 😖
[15:04:48] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[15:04:49] <marostegui>	 xDDD
[15:05:11] <XioNoX>	 rzl: https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=All&var-target_site=eqsin&var-role=cr&var-family=All something is up with eqsin or transport to there
[15:05:20] <vgutierrez>	 hmm connectivity issue on eqsin?
[15:05:33] <RhinosF1>	 I can ping 4 and 6 but curl fails
[15:05:41] <rzl>	 thanks -- laptop's still warming up, then I'll depool
[15:05:45] <RhinosF1>	 okay just slow
[15:05:58] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 21.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:06:01] <cdanis>	 there was also a big spike of appserver queries, although it didn't affect saturation and they were very fast queries
[15:06:30] <RhinosF1>	 it responded much faster to a 2nd curl
[15:06:38] <Lucas_WMDE>	 RhinosF1: I made a phab task for that scap message earlier: https://phabricator.wikimedia.org/T313417
[15:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:18] <wikibugs>	 (03PS1) 10Majavah: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815750
[15:07:20] <vgutierrez>	 incoming traffic drop on lvs5001 and lvs5002... https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=46&from=now-3h&to=now
[15:07:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312990)', diff saved to https://phabricator.wikimedia.org/P31528 and previous config saved to /var/cache/conftool/dbconfig/20220720-150730-marostegui.json
[15:07:31] <wikibugs>	 (03PS1) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815751
[15:07:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[15:07:34] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[15:07:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[15:07:50] <XioNoX>	 higher latency through both transports so not related to transport link
[15:07:59] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815751 (owner: 10RLazarus)
[15:08:19] <jinxer-wm>	 (ProbeDown) resolved: (4) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:08:19] <jinxer-wm>	 (ProbeDown) resolved: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:08:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 78.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:08:37] <RhinosF1>	 Lucas_WMDE: ack
[15:08:45] <wikibugs>	 (03PS2) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728
[15:08:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[15:09:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[15:09:05] <rzl>	 hm, might be recovered
[15:09:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T312990)', diff saved to https://phabricator.wikimedia.org/P31529 and previous config saved to /var/cache/conftool/dbconfig/20220720-150908-marostegui.json
[15:09:32] <rzl>	 I haven't merged the depool yet, going to hold it but keep it ready
[15:09:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[15:09:52] <rzl>	 ^ that's just delayed
[15:10:07] <logmsgbot>	 !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw
[15:10:09] <RhinosF1>	 rzl: curl is instant compared to very slow when the page happened now
[15:10:27] <cdanis>	 RhinosF1: is your traffic hitting eqsin or somewhere else?
[15:10:59] <RhinosF1>	 cdanis: i just like being nosey and checked when i saw the page because i'm bored and have nothing better to do
[15:11:25] <cdanis>	 rzl: https://i.imgur.com/dx0Ktg0.png
[15:11:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:46] <cdanis>	 rzl: I kind of suspect that there was a spike of something served statically by the appservers but not cached on mostly eqsin, enough to saturate both transits?
[15:12:06] <rzl>	 hm! text but not upload, which makes that a little trickier
[15:12:10] <rzl>	 but still possible
[15:12:11] <XioNoX>	 cdanis: transports you mean?
[15:12:14] <cdanis>	 XioNoX: yes sorry
[15:12:16] <cdanis>	 transports
[15:12:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312990)', diff saved to https://phabricator.wikimedia.org/P31530 and previous config saved to /var/cache/conftool/dbconfig/20220720-151216-marostegui.json
[15:12:27] <cdanis>	 rzl: XioNoX: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=appserver&var-instance=All&var-datasource=thanos&from=1658328835477&to=1658329930726&viewPanel=84
[15:12:37] <cdanis>	 appserver cluster was txing 1.6 Gbyte/sec
[15:13:15] <wikibugs>	 (03PS3) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728
[15:13:17] <rzl>	 wild
[15:13:38] <rzl>	 okay, will dig a little and see if I can figure out what that traffic was, but first coffee
[15:13:50] <rzl>	 XioNoX: just to verify, are you okay leaving eqsin pooled unless this comes back?
[15:14:07] <XioNoX>	 rzl: yeah everything is back to normal network wise
[15:14:13] <rzl>	 rad, thanks
[15:14:18] <XioNoX>	 https://librenms.wikimedia.org/device/device=159/tab=port/port=13968/ indeed some spikes on the transport links
[15:15:26] <XioNoX>	 the codfw-eqsin link lost its OSPF adjacency
[15:16:19] <XioNoX>	 very briefly, traffic routed through ulsfo
[15:16:34] <XioNoX>	 Jul 20 15:02:41  cr3-eqsin bfdd[16011]: BFD Session 103.102.166.139 (IFL 90) state Up -> Down LD/RD(171/2015) Up time:1w0d 19:19 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
[15:16:36] <vgutierrez>	 purged shows a 5 minutes lag around 15:00 - 15:05
[15:16:39] <vgutierrez>	 (on eqsin)(
[15:16:49] <cdanis>	 XioNoX: fallout of saturating that link?
[15:17:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fix db2167:3318', diff saved to https://phabricator.wikimedia.org/P31531 and previous config saved to /var/cache/conftool/dbconfig/20220720-151711-marostegui.json
[15:17:16] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:17:22] <XioNoX>	 cdanis: it's not common but could be yeah
[15:20:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage
[15:23:12] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage
[15:26:01] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 859341 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[15:26:08] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[15:26:09] <logmsgbot>	 !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw
[15:26:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php_exporter: only export the proper php version [puppet] - 10https://gerrit.wikimedia.org/r/815755
[15:27:05] <icinga-wm>	 RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:27:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P31532 and previous config saved to /var/cache/conftool/dbconfig/20220720-152721-marostegui.json
[15:28:16] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[15:31:17] <dancy>	 jouncebot nowandnext
[15:31:17] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 28 minute(s)
[15:31:17] <jouncebot>	 In 2 hour(s) and 28 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800)
[15:31:18] <jouncebot>	 In 2 hour(s) and 28 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800)
[15:31:58] <dancy>	 I'm going to run a couple of `scap sync-wikiversions` tests
[15:32:54] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10Cmjohnson) I swapped the optics for both and cleaned fiber.
[15:35:50] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided)
[15:38:17] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Ladsgroup) If I can understand the "somehow" better, I might be able to help.  Did you try changing the email in https://lists.wikimedia.org/accounts/email/?
[15:39:06] <wikibugs>	 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr)
[15:39:19] <wikibugs>	 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr)
[15:39:37] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: testing
[15:39:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:39:47] <dancy>	 Done testing
[15:39:59] <wikibugs>	 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr) Racked and installed adapters.  Adjusted racking location to U46
[15:40:06] <wikibugs>	 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr) 05Open→03Resolved
[15:41:29] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:42:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P31534 and previous config saved to /var/cache/conftool/dbconfig/20220720-154227-marostegui.json
[15:43:39] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I thinks this change is good as is. Ahmon has a point we should aim at using the global generated known_hosts file under /etc , but I beli" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[15:45:37] <wikibugs>	 (03CR) 10Ahmon Dancy: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[15:45:51] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[15:46:08] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2036.codfw.wmnet with OS bullseye
[15:46:16] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661)
[15:46:17] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye completed: - elastic2034 (**PAS...
[15:46:20] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye executed with errors: - elastic...
[15:48:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wwwportals: clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815759
[15:50:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10RASharma_WMF) Hi,  Id would mean the WMF email address.  Currently when I try to log in, using my WMF email address, I am asked to verify (which I purposefully haven't)...
[15:50:20] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw
[15:50:29] <jayme>	 \o/
[15:51:39] <wikibugs>	 (03PS1) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729
[15:51:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM for the python side" [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[15:52:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "+1 I would love gerrit_servers to be renamed ssh_allowed_hosts in a future change (see inline comment for rationale)." [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[15:52:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:53:33] <wikibugs>	 (03PS2) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729
[15:53:51] <wikibugs>	 (03PS1) 10Jbond: C:varnish: improve error messaging for reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/815761
[15:53:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[15:54:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wwwportals: clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815759
[15:54:57] <wikibugs>	 (03PS3) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729
[15:55:41] <icinga-wm>	 PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:47] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.11.2" for 557 hosts
[15:57:07] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.11.2" completed for 557 hosts
[15:57:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312990)', diff saved to https://phabricator.wikimedia.org/P31535 and previous config saved to /var/cache/conftool/dbconfig/20220720-155732-marostegui.json
[15:57:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[15:57:37] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[15:57:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[15:57:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31536 and previous config saved to /var/cache/conftool/dbconfig/20220720-155752-marostegui.json
[16:01:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31537 and previous config saved to /var/cache/conftool/dbconfig/20220720-160103-marostegui.json
[16:01:37] <wikibugs>	 (03Abandoned) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729 (owner: 10Dduvall)
[16:01:56] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661)
[16:03:53] <wikibugs>	 (03CR) 10JMeybohm: k8s: Adapt retry parameters to reality (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[16:03:59] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[16:05:45] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[16:05:51] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[16:08:29] <icinga-wm>	 RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P31538 and previous config saved to /var/cache/conftool/dbconfig/20220720-161608-marostegui.json
[16:17:26] <wikibugs>	 (03Abandoned) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815751 (owner: 10RLazarus)
[16:21:55] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[16:23:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P31539 and previous config saved to /var/cache/conftool/dbconfig/20220720-163113-marostegui.json
[16:35:45] <icinga-wm>	 RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[16:40:46] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[16:41:27] <wikibugs>	 (03PS2) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724
[16:43:16] <wikibugs>	 (03PS3) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724
[16:44:13] <wikibugs>	 (03CR) 10JHathaway: beaker: add a method to hack fixes specific to beaker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond)
[16:44:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans)
[16:46:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31540 and previous config saved to /var/cache/conftool/dbconfig/20220720-164618-marostegui.json
[16:46:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[16:46:24] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[16:46:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[16:46:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T312990)', diff saved to https://phabricator.wikimedia.org/P31541 and previous config saved to /var/cache/conftool/dbconfig/20220720-164638-marostegui.json
[16:47:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans)
[16:47:27] <rzl>	 jouncebot: nowandnext
[16:47:27] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[16:47:27] <jouncebot>	 In 1 hour(s) and 12 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800)
[16:47:27] <jouncebot>	 In 1 hour(s) and 12 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800)
[16:47:59] <rzl>	 borrowing mwdebug1001 to test an apache change, won't be long
[16:48:54] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] wwwportals: clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815759 (owner: 10Giuseppe Lavagetto)
[16:49:20] <rzl>	 !log rzl@cumin2002:~$ sudo cumin A:mw 'disable-puppet 815759'
[16:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:27] <wikibugs>	 (03PS4) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724
[16:49:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312990)', diff saved to https://phabricator.wikimedia.org/P31542 and previous config saved to /var/cache/conftool/dbconfig/20220720-164946-marostegui.json
[16:53:02] <rzl>	 correction, borrowing mwdebug1002
[16:59:41] <wikibugs>	 (03PS1) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746)
[17:01:27] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:01:51] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:01:53] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:06] <wikibugs>	 (03PS1) 10RLazarus: wwwportals: Actually clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815770
[17:04:23] <rzl>	 (httpbb alerts are expected)
[17:04:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P31543 and previous config saved to /var/cache/conftool/dbconfig/20220720-170451-marostegui.json
[17:05:49] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2048.codfw.wmnet with OS bullseye
[17:05:55] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2048.codfw.wmnet with OS bullseye
[17:07:49] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] wwwportals: Actually clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815770 (owner: 10RLazarus)
[17:12:11] <rzl>	 !log rzl@cumin2002:~$ sudo cumin A:mw 'enable-puppet 815759'
[17:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:23] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:19:48] <rzl>	 done
[17:19:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P31544 and previous config saved to /var/cache/conftool/dbconfig/20220720-171956-marostegui.json
[17:25:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2048.codfw.wmnet with reason: host reimage
[17:28:16] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2048.codfw.wmnet with reason: host reimage
[17:35:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312990)', diff saved to https://phabricator.wikimedia.org/P31545 and previous config saved to /var/cache/conftool/dbconfig/20220720-173502-marostegui.json
[17:35:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[17:35:07] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[17:35:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[17:35:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T312990)', diff saved to https://phabricator.wikimedia.org/P31546 and previous config saved to /var/cache/conftool/dbconfig/20220720-173522-marostegui.json
[17:35:43] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs: don't page for most checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro)
[17:35:59] <wikibugs>	 (03PS1) 10Ssingh: durum: improve check frontend loading message [puppet] - 10https://gerrit.wikimedia.org/r/815771
[17:38:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312990)', diff saved to https://phabricator.wikimedia.org/P31547 and previous config saved to /var/cache/conftool/dbconfig/20220720-173823-marostegui.json
[17:38:30] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2048.codfw.wmnet with OS bullseye
[17:38:36] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2048.codfw.wmnet with OS bullseye executed with errors: - elastic...
[17:38:38] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[17:38:41] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[17:39:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] durum: improve check frontend loading message [puppet] - 10https://gerrit.wikimedia.org/r/815771 (owner: 10Ssingh)
[17:45:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF)
[17:47:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: fix sync_time for copying fernet keys [puppet] - 10https://gerrit.wikimedia.org/r/815772
[17:48:51] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+1] scap.cfg.erb: Set gerrit_push_user: trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/815329 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[17:50:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: fix sync_time for copying fernet keys [puppet] - 10https://gerrit.wikimedia.org/r/815772 (owner: 10Andrew Bogott)
[17:50:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[17:51:00] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[17:52:01] <wikibugs>	 (03PS1) 10David Caro: wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773
[17:52:47] <wikibugs>	 (03PS5) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241)
[17:53:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 (owner: 10David Caro)
[17:53:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P31548 and previous config saved to /var/cache/conftool/dbconfig/20220720-175328-marostegui.json
[17:54:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Once https://gerrit.wikimedia.org/r/c/operations/alerts/+/815773 is in place, I'm happy with this patch!" [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro)
[17:56:37] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:58:20] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 (owner: 10David Caro)
[18:00:05] <jouncebot>	 jeena and jnuche: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800).
[18:00:05] <jouncebot>	 jeena and jnuche: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800).
[18:00:48] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 (owner: 10David Caro)
[18:01:39] <wikibugs>	 (03PS4) 10Ryan Kemper: Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking)
[18:01:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "striker: Open firewall for Docker-managed service" [puppet] - 10https://gerrit.wikimedia.org/r/811274 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[18:03:07] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking)
[18:04:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: cloudweb-dev: route striker to the docker port [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah)
[18:06:44] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I completely forgot about that series of patch. Congratulations!" [puppet] - 10https://gerrit.wikimedia.org/r/815329 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[18:06:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "Striker on cloudweb1002 is broken the same way after this patch as before, so... success?" [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah)
[18:08:28] <wikibugs>	 (03PS1) 10Jeena Huneidi: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815775 (https://phabricator.wikimedia.org/T308074)
[18:08:30] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815775 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[18:08:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P31549 and previous config saved to /var/cache/conftool/dbconfig/20220720-180834-marostegui.json
[18:09:24] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815775 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[18:12:57] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.21  refs T308074
[18:13:02] <stashbot>	 T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074
[18:15:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:15:16] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[18:15:20] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[18:16:05] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.21  refs T308074 (duration: 03m 07s)
[18:16:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:16:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:17:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:17:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2045.codfw.wmnet with OS bullseye
[18:17:30] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye
[18:18:25] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:22:04] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943)
[18:23:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312990)', diff saved to https://phabricator.wikimedia.org/P31550 and previous config saved to /var/cache/conftool/dbconfig/20220720-182339-marostegui.json
[18:23:47] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[18:23:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2079.codfw.wmnet with reason: Maintenance
[18:24:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2079.codfw.wmnet with reason: Maintenance
[18:24:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on 15 hosts with reason: Maintenance
[18:24:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 15 hosts with reason: Maintenance
[18:24:44] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943)
[18:25:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[18:25:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[18:25:47] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943)
[18:26:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[18:26:53] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper)
[18:27:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[18:27:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T312990)', diff saved to https://phabricator.wikimedia.org/P31551 and previous config saved to /var/cache/conftool/dbconfig/20220720-182710-marostegui.json
[18:28:25] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::grid: add bash completion to exec-manage [puppet] - 10https://gerrit.wikimedia.org/r/815780
[18:29:54] <wikibugs>	 (03PS2) 10Cwhite: logstash: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175)
[18:30:24] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:35:03] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2045.codfw.wmnet with OS bullseye
[18:35:10] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye executed with errors: - elastic...
[18:36:48] <wikibugs>	 (03PS1) 10Ebernhardson: apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781
[18:36:50] <wikibugs>	 (03PS1) 10Ebernhardson: apifeatureusage: Adjust index template to use _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815782
[18:36:52] <wikibugs>	 (03PS1) 10Ebernhardson: apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783
[18:36:54] <wikibugs>	 (03PS1) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784
[18:37:04] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[18:37:08] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[18:37:13] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: add rack info for 3 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/815785 (https://phabricator.wikimedia.org/T300943)
[18:38:47] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: add rack info for 3 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/815785 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper)
[18:49:26] <wikibugs>	 (03Abandoned) 10Ebernhardson: Remove i18n and IS references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814873 (https://phabricator.wikimedia.org/T313248) (owner: 10Ebernhardson)
[18:52:48] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack Nova: Allow duplicate VM names in different projects. [puppet] - 10https://gerrit.wikimedia.org/r/815787 (https://phabricator.wikimedia.org/T305831)
[18:53:45] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF)
[18:59:28] <wikibugs>	 (03PS2) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746)
[19:00:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall)
[19:00:30] <icinga-wm>	 PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-codfw.service,elasticsearch_6@production-search-psi-codfw.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:03:51] <wikibugs>	 (03CR) 10Dduvall: "Jelto, I've attempted a refactor here to: 1) hopefully simplify the approach; 2) properly re-configure existing runners; the prior patch c" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall)
[19:04:41] <wikibugs>	 (03PS3) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746)
[19:07:17] <taavi>	 jeena: jnuche: hey, can we rollback group1 due to T313432? commons uploading interfaces (Special:Upload and Special:UploadWizard) are completely broken at least for me
[19:07:18] <stashbot>	 T313432: Error: Call to a member function getConfig() on null - https://phabricator.wikimedia.org/T313432
[19:07:47] <jeena>	 okay, I'll roll back
[19:07:54] <taavi>	 thanks
[19:08:05] <taavi>	 I'm looking if I can find any obvious causes for that
[19:08:13] <jeena>	 thanks :)
[19:09:00] <wikibugs>	 (03PS1) 10Ssingh: durum: CSP default-src should be none [puppet] - 10https://gerrit.wikimedia.org/r/815789
[19:09:43] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36332/console" [puppet] - 10https://gerrit.wikimedia.org/r/815789 (owner: 10Ssingh)
[19:12:46] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: CSP default-src should be none [puppet] - 10https://gerrit.wikimedia.org/r/815789 (owner: 10Ssingh)
[19:13:06] <taavi>	 found the cause, let's see if I can fix it easily
[19:13:26] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group[0|1] wikis to [VERSION]"
[19:13:46] <icinga-wm>	 RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:47] <jeena>	 ugh sorry for messing up the message
[19:15:29] <jeena>	 !log that should be revert group1 wikis to 1.39.0-wmf.19
[19:16:19] <mutante>	 you can still edit on the wiki if you want
[19:16:47] <bd808>	 but not on twitter, mastodon, or sal.toolforge.org :)
[19:17:02] <jeena>	 💀
[19:17:08] <taavi>	 (potential) fix is up on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/3D/+/815790/
[19:17:15] * bd808 believes he typos more !log messages than not
[19:19:01] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Danielgblack) If you've still got the perf base of the flamegraph, is a possible to get  ` perf report --no-children --stdio -i inp...
[19:19:56] <wikibugs>	 (03PS2) 10Ebernhardson: apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781 (https://phabricator.wikimedia.org/T313434)
[19:19:58] <wikibugs>	 (03PS2) 10Ebernhardson: apifeatureusage: Adjust index template to use _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815782 (https://phabricator.wikimedia.org/T313434)
[19:20:00] <wikibugs>	 (03PS2) 10Ebernhardson: apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434)
[19:20:02] <wikibugs>	 (03PS2) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434)
[19:20:19] <wikibugs>	 (03PS1) 10Jeena Huneidi: Revert "group1 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074)
[19:20:21] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[19:20:26] <wikibugs>	 (03PS1) 10Ladsgroup: PatentFormField: pass on $this->mParent to HTMLRadioField constructor [extensions/3D] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815733 (https://phabricator.wikimedia.org/T313432)
[19:20:32] <taavi>	 thanks Amir1 
[19:20:39] <taavi>	 who wants to deploy a backport?
[19:20:46] <Amir1>	 thank you for the patch
[19:20:53] <Amir1>	 where is jeena
[19:21:01] <Amir1>	 Can I?
[19:21:15] <jeena>	 I just need to merge this config patch
[19:22:19] <wikibugs>	 (03PS2) 10Jeena Huneidi: Revert "group1 wikis to 1.39.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074)
[19:22:34] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[19:23:01] <jeena>	 Amir1: all good
[19:23:16] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] PatentFormField: pass on $this->mParent to HTMLRadioField constructor [extensions/3D] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815733 (https://phabricator.wikimedia.org/T313432) (owner: 10Ladsgroup)
[19:23:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[19:24:54] <wikibugs>	 (03CR) 10JHathaway: "Running:" [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond)
[19:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: PatentFormField: pass on $this->mParent to HTMLRadioField constructor [extensions/3D] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815733 (https://phabricator.wikimedia.org/T313432) (owner: 10Ladsgroup)
[19:25:30] <Amir1>	 that was fast
[19:26:40] <Amir1>	 taavi: now you can't test it I guess?
[19:26:54] <Amir1>	 jeena: now that group1 is back wmf.19 we can't see if it's fixed
[19:27:05] <jeena>	 I can roll forward if you like
[19:27:06] <taavi>	 I can manually hack commons to .21 on a mwdebug box
[19:27:20] <Amir1>	 taavi: nah, let's push and move forward
[19:27:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312990)', diff saved to https://phabricator.wikimedia.org/P31552 and previous config saved to /var/cache/conftool/dbconfig/20220720-192724-marostegui.json
[19:27:30] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[19:27:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:28:01] <jeena>	 should I go ahead?
[19:28:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:28:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:29:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:30:21] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "I got confused by the title and changed the commit message... So the original message Revert "group1 wikis to 1.39.0-wmf.21" was correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[19:30:59] <Amir1>	 jeena: the sync will finish in a sec
[19:32:24] <Amir1>	 I have some important stuff (not risky, important) being shipped in wmf.21, I really want to see it done
[19:32:36] <Amir1>	 The last pieces of templatelinks normalization 
[19:32:45] <jeena>	 :)
[19:33:57] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/3D/src/PatentFormField.php: Backport: [[gerrit:815733|PatentFormField: pass on $this->mParent to HTMLRadioField constructor (T313432)]] (duration: 03m 08s)
[19:34:01] <stashbot>	 T313432: Error: Call to a member function getConfig() on null - https://phabricator.wikimedia.org/T313432
[19:34:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:35:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:35:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:36:20] <Amir1>	 jeena: done, feel free to move ahead
[19:36:38] <jeena>	 thanks Amir1 &  taavi 
[19:36:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:36:58] <wikibugs>	 (03PS1) 10Jeena Huneidi: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815793 (https://phabricator.wikimedia.org/T308074)
[19:37:04] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815793 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[19:37:08] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn)
[19:38:33] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815793 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi)
[19:40:41] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Notes based on IRC discussion on #wikimedia-traffic:  * We only want to apply query sorting to text requests for now, because we ca...
[19:41:28] <taavi>	 fix confirmed working
[19:41:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:42:07] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.21  refs T308074
[19:42:10] <stashbot>	 T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074
[19:42:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P31553 and previous config saved to /var/cache/conftool/dbconfig/20220720-194229-marostegui.json
[19:42:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:42:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:43:28] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) also see T312194#8092388  We now have working checks.  Here you can see how it is working:  https://grafana-rw.wikimedia.org/explore...
[19:43:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:45:00] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.21  refs T308074 (duration: 02m 53s)
[19:45:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[19:46:07] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Add suppport for multiple backup/passive nodes in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:48:29] <wikibugs>	 (03CR) 10Dzahn: gerrit: add gerrit2002 to firewall rules for cluster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[19:48:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:49:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:49:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:50:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:51:33] <wikibugs>	 (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikiquote vhost [puppet] - 10https://gerrit.wikimedia.org/r/815794 (https://phabricator.wikimedia.org/T273179)
[19:53:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[19:53:20] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[19:54:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2032.codfw.wmnet with OS bullseye
[19:55:02] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2032.codfw.wmnet with OS bullseye
[19:57:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P31554 and previous config saved to /var/cache/conftool/dbconfig/20220720-195734-marostegui.json
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T2000).
[20:00:05] <jouncebot>	 MatmaRex and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:25] <cjming>	 i can deploy o/ (since i'm covering for Jon's patch)
[20:00:41] <urbanecm>	 o/
[20:00:44] <urbanecm>	 I'm also here if needed
[20:00:55] <MatmaRex>	 hi
[20:01:08] <cjming>	 hi MatmaRex - let's do it
[20:01:14] <wikibugs>	 (03PS2) 10Clare Ming: Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) (owner: 10Bartosz Dziewoński)
[20:03:24] <cjming>	 thanks urbanecm! hopefully you won't be needed
[20:04:05] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) (owner: 10Bartosz Dziewoński)
[20:04:49] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) (owner: 10Bartosz Dziewoński)
[20:06:01] <cjming>	 MatmaRex: ur patch is up on mwdebug1002 - is it testable?
[20:06:10] <MatmaRex>	 yeah, looking
[20:06:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson)
[20:06:49] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add gerrit2002 to firewall rules for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250)
[20:07:29] <Jdlrobson>	 @cjming I'm around after all if you need some help with testing
[20:07:53] <cjming>	 cool - thanks Jdlrobson
[20:07:56] <MatmaRex>	 cjming: looks good
[20:08:01] <cjming>	 fabu - syncing
[20:08:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2032.codfw.wmnet with reason: host reimage
[20:08:52] <wikibugs>	 (03PS6) 10Clare Ming: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson)
[20:08:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Expect logstash will be restarted by puppet when this gets deployed." [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson)
[20:10:25] <wikibugs>	 (03CR) 10Jbond: beaker: add initial beaker files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond)
[20:10:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:11:21] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815359|Enable DiscussionTools visualenhancements as beta feature on partner wikis (T312670)]] (duration: 03m 10s)
[20:11:25] <stashbot>	 T312670: [Config Change] Enable Topic Containers as beta feature at partner wikis (desktop) - https://phabricator.wikimedia.org/T312670
[20:11:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:45] <cjming>	 MatmaRex: can you verify on prod? there were some issues syncing yesterday
[20:11:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2032.codfw.wmnet with reason: host reimage
[20:12:00] <MatmaRex>	 ok
[20:12:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:12:26] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "Patch looks good, I'm curious why it took so long to run on my box, could be an issue with podman." [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond)
[20:12:32] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:12:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312990)', diff saved to https://phabricator.wikimedia.org/P31555 and previous config saved to /var/cache/conftool/dbconfig/20220720-201240-marostegui.json
[20:12:44] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[20:14:05] <MatmaRex>	 cjming: is the change deployed now? i don't see the expected effect when not using mwdebug
[20:14:30] <cjming>	 hmm - should be
[20:14:48] <MatmaRex>	 if you could confirm
[20:14:54] <cjming>	 dancy: if you're around, do i still need to double sync?
[20:14:54] <MatmaRex>	 you need to enable "Discussion tools" at https://ar.wikipedia.org/wiki/خاص:تفضيلات?uselang=en#mw-prefsection-betafeatures
[20:15:03] <MatmaRex>	 and then visit https://ar.wikipedia.org/wiki/نقاش:الصفحة_الرئيسية
[20:15:18] <dancy>	 cjming: That problem should be fixed now.
[20:15:20] <MatmaRex>	 each heading should have some metadata added underneath it, in grey text
[20:16:09] <MatmaRex>	 i am only seeing the change inconsistently whenever i reload the page
[20:16:19] <MatmaRex>	 so this seems related to issues we've had, uhh, a couple weeks ago?
[20:16:26] <dancy>	 That does seem to imply the same type of syncing problem.
[20:16:28] <MatmaRex>	 where changes didn't take effect on some servers
[20:16:59] <cjming>	 dancy: should i sync again just to be sure?
[20:17:15] <dancy>	 lemme see if I can dig up some evidence first.
[20:17:44] <cjming>	 great - thanks
[20:20:40] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:20:44] <cjming>	 MatmaRex: i don't really know what i'm looking at - i'm looking for diffs but not sure what is expected on the page you linked
[20:21:22] <MatmaRex>	 one sec
[20:22:22] <MatmaRex>	 current (bad): https://phabricator.wikimedia.org/F35326992 expected (good): https://phabricator.wikimedia.org/F35326993
[20:22:26] <cjming>	 actually now i think i see it 
[20:22:33] <MatmaRex>	 note the different font and the text line below heading
[20:23:07] <MatmaRex>	 the same effect should appear on any talk page (this link is the talk of the main page)
[20:23:49] <dancy>	 Clare please resync
[20:23:54] <cjming>	 alrighty
[20:27:15] <dancy>	 I assume there were no interesting messages during the first sync 
[20:27:30] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815359|Enable DiscussionTools visualenhancements as beta feature on partner wikis (T312670)]] (duration: 03m 26s)
[20:27:34] <stashbot>	 T312670: [Config Change] Enable Topic Containers as beta feature at partner wikis (desktop) - https://phabricator.wikimedia.org/T312670
[20:28:53] <cjming>	 dancy: not that i recall - and just now i resync'd but i'm still not seeing the expected change in one of my tabs
[20:29:22] <cjming>	 fwiw resync seemed suspiciously fast
[20:29:34] <dancy>	 How long?
[20:29:39] <dancy>	 (for the php-fpm-restart phase)
[20:30:27] <MatmaRex>	 i'm seeing the expcted result now, over several refreshes
[20:30:27] <dancy>	 It should be around 2 minutes
[20:30:29] <cjming>	 it says 2m 42s
[20:30:40] <dancy>	 ok that's a normal duration
[20:30:43] <cjming>	 MatmaRex: great!
[20:30:47] <cjming>	 maybe we're good
[20:30:49] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:31:04] <dancy>	 So we still have the same problem. 
[20:31:08] <dancy>	 I'll reopen the ticket
[20:31:09] <cjming>	 bummer
[20:31:52] <cjming>	 ok - I guess i'll move on then
[20:31:59] <cjming>	 and resync if needed
[20:32:29] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson)
[20:32:57] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2032.codfw.wmnet with OS bullseye
[20:33:02] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2032.codfw.wmnet with OS bullseye completed: - elastic2032 (**WAR...
[20:33:15] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson)
[20:33:54] <MatmaRex>	 thanks
[20:34:31] <MatmaRex>	 and thanks for double-checking cjming, i also assumed that this would be fixed
[20:34:35] <cjming>	 np!
[20:34:47] <dancy>	 It was supposed to be!
[20:36:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36334/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:37:36] <mutante>	 firewall change on gerrit... incoming...
[20:37:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:38:06] <mutante>	 ferm reloaded. gerrit still up. go on :)
[20:38:35] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814906|Deploy the new grid layout to group 1 (T312241)]] (duration: 03m 14s)
[20:38:38] <stashbot>	 T312241: Deploy the new grid layout - https://phabricator.wikimedia.org/T312241
[20:38:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:38:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:39:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10wiki_willy) a:03Jclark-ctr
[20:39:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:40:04] <Jdlrobson>	 cjming: I see grid on debug1001 on group 1 wikis so I think that's good to sync
[20:40:17] <icinga-wm>	 PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:41:19] <cjming>	 Jdlrobson: sounds good - i'm actually resyncing bec i didn't see grid on itwiki - can you verify on prod?
[20:41:25] <wikibugs>	 (03CR) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:41:31] <wikibugs>	 (03PS3) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250)
[20:41:40] <Jdlrobson>	 I was looking at Hebrew
[20:41:48] <Jdlrobson>	 I see it on Italian too thoug
[20:42:01] <cjming>	 oh good
[20:42:36] <wikibugs>	 (03CR) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:42:59] <cjming>	 and i see it now so i think we're good
[20:43:00] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814906|Deploy the new grid layout to group 1 (T312241)]] (duration: 03m 16s)
[20:45:19] <cjming>	 shutting it down early -- if someone needs something in the next few, just give me a poke
[20:45:27] <cjming>	 !log end of UTC late backport window
[20:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1002/36331/" [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[20:53:55] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:56:23] <icinga-wm>	 RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:58:07] <wikibugs>	 (03CR) 10Dzahn: "compiling this shows this as noop but that's because only the "homedir" is a puppet resource and it has "recurse => 'remote'"" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[21:13:51] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:14:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[21:19:04] <dancy>	 cjming: Are you still around?  I'd like to look at the transcripts of the deployments you did today to see if I can draw any conclusions.
[21:19:37] <cjming>	 dancy: yup -- i'll see if i can dig them up
[21:19:39] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[21:21:32] <wikibugs>	 (03PS1) 10Cwhite: logstash: add missing closing curly brace [puppet] - 10https://gerrit.wikimedia.org/r/815799 (https://phabricator.wikimedia.org/T305175)
[21:22:04] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: add missing closing curly brace [puppet] - 10https://gerrit.wikimedia.org/r/815799 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[21:24:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) It looks like the maximum rate at which swift-object-expirer will issue deletes is configurable via [[ https://github.com/op...
[21:24:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:29:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:33:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yea, so I don't really know much here (how to test it, when the previous check was added) but let me say I have no concerns if you just do" [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[21:34:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[21:37:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[21:39:44] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs)
[21:41:44] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson)
[21:41:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10nskaggs) >>! In T313382#8090176, @Marostegui wrote: > - dbproxy1018 and dbproxy1019 are active WMCS proxies, need to be handled by them cc @nskaggs (they should...
[21:42:23] <wikibugs>	 (03CR) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn)
[21:43:59] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10JJMC89)
[21:45:21] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[21:45:27] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[21:46:36] <wikibugs>	 (03CR) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn)
[21:52:55] <wikibugs>	 (03CR) 10Cwhite: "Following up from Keith's comment, one possible solution to the orphaned configs." [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[21:53:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135
[21:53:40] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[21:54:54] <wikibugs>	 (03PS2) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176
[21:57:46] <wikibugs>	 (03CR) 10Dzahn: "a bit of rebasing hell due to other changes but fixing it" [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn)
[22:12:39] <wikibugs>	 (03PS3) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176
[22:16:08] <wikibugs>	 (03PS4) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176
[22:18:46] <wikibugs>	 (03CR) 10BryanDavis: [V: 03+1] hieradata: cloudweb-dev: route striker to the docker port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah)
[22:23:24] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) Sorry it took me a bit to get it done:  - https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/nochildern.superbu...
[22:24:33] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:26:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:26:54] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) here is a screenshot that shows how to get this on https://grafana-rw.wikimedia.org {F35327060}
[22:36:25] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:43:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:48:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:51:59] <icinga-wm>	 RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:03:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10RKemper) 05Open→03Resolved >>! In T313369#8091365, @Volans wrote: > Updated the comment above as I made the command safer directly in the docs :)  Thanks! I fol...
[23:07:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2067.codfw.wmnet with OS bullseye
[23:07:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2068.codfw.wmnet with OS bullseye
[23:10:09] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2070.codfw.wmnet with OS bullseye
[23:10:10] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2071.codfw.wmnet with OS bullseye
[23:10:11] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2072.codfw.wmnet with OS bullseye
[23:11:55] <ryankemper>	 !log T300943 Fixed IPMI passwords for elastic `20[67,68,70,71,72]`, reimaging them to bullseye (these hosts are not in service, thus the batch operation)
[23:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:00] <stashbot>	 T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943
[23:22:05] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2068.codfw.wmnet with reason: host reimage
[23:22:12] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2067.codfw.wmnet with reason: host reimage
[23:24:13] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2070.codfw.wmnet with reason: host reimage
[23:24:21] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2072.codfw.wmnet with reason: host reimage
[23:24:29] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2071.codfw.wmnet with reason: host reimage
[23:24:49] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2068.codfw.wmnet with reason: host reimage
[23:28:21] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2071.codfw.wmnet with reason: host reimage
[23:29:44] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic2067.codfw.wmnet with reason: host reimage
[23:29:52] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2070.codfw.wmnet with reason: host reimage
[23:32:14] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2072.codfw.wmnet with reason: host reimage
[23:34:19] <wikibugs>	 (03PS5) 10Jdlrobson: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241)
[23:37:04] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:38:48] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2068.codfw.wmnet with OS bullseye
[23:42:33] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2071.codfw.wmnet with OS bullseye
[23:43:55] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2070.codfw.wmnet with OS bullseye
[23:44:13] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2067.codfw.wmnet with OS bullseye
[23:46:40] <wikibugs>	 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10tstarling) I'm working on T279664. Active/active multi-DC mode for MediaWiki is coming very soon. About a month ago I did a quick review of multi-DC support in the...
[23:47:20] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2072.codfw.wmnet with OS bullseye
[23:49:58] <wikibugs>	 (03PS8) 10Fomafix: Add language codes sr-cyrl and sr-latn next to sr-ec and sr-el [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845)
[23:52:00] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:55:51] <wikibugs>	 (03PS14) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845)
[23:57:27] <wikibugs>	 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10tstarling) T201858 contains a generous dose of clue. Gilles said "I suspect that making thumbnail traffic active/active might actually require less effort than the...