[00:18:46] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:18:56] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:54] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939279
[00:38:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939279 (owner: 10TrainBranchBot)
[00:52:59] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939279 (owner: 10TrainBranchBot)
[01:08:33] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] "I think that Taavi's fix at Ieb941d4b159a8dd5dfc329cf678af97d5ec85bc0 has eliminated the need for this complexity. He tested things agains" [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede)
[01:09:02] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938683 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester)
[01:10:12] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938684 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester)
[01:13:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:18:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:24:37] <wikibugs>	 (03Merged) 10jenkins-bot: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938683 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester)
[01:25:35] <wikibugs>	 (03Merged) 10jenkins-bot: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938684 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester)
[01:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:35:33] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.17/includes/diff/DifferenceEngine.php: fix prod error T342099, T341961 (duration: 09m 20s)
[01:35:39] <stashbot>	 T341961: UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T341961
[01:35:39] <stashbot>	 T342099: PHP Warning: Illegal offset type in LanguageFactory::getLanguage(StubUserLang); subsequently UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T342099
[01:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:39:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:46:20] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.18/includes/diff/DifferenceEngine.php: fix prod error T342099, T341961 (duration: 08m 32s)
[01:46:25] <stashbot>	 T341961: UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T341961
[01:46:25] <stashbot>	 T342099: PHP Warning: Illegal offset type in LanguageFactory::getLanguage(StubUserLang); subsequently UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T342099
[01:52:49] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] Create puppet scripting for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:00] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:19:12] <wikibugs>	 (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[04:32:31] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10Joe) 05Open→03Resolved Hi @TheDJ, what you're seeing there is a big influx of 429 from our systems rate-limiting some very aggressive api user from a public cloud.  To put this in prespective, we...
[04:33:33] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=parse1002.*
[04:37:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Joe) 05Resolved→03Open The server went down twice in one day yesterday, see T342298. So you can sadly uncross your fingers, @akosiaris :(
[04:38:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Joe) The host is again set to inactive, and still not powercycled so that any further debugging can be performed, if needed.
[04:46:41] <wikibugs>	 (03CR) 10Kaleem Bhatti: [C: 03+1] "anyone please merge this I don't know else I can" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[05:35:40] <icinga-wm>	 PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:46:15] <wikibugs>	 (03PS2) 10Sohom Datta: Enable EditInSequence in pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786)
[05:50:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42597/console" [puppet] - 10https://gerrit.wikimedia.org/r/939700 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[05:50:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] services_proxy: add mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/939700 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[05:55:35] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252)
[05:56:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T0600).
[06:29:24] <icinga-wm>	 RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:37:16] <elukey>	 !log start kafka main eqiad maintenance (partitions rebalancing) - T341558
[06:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:20] <stashbot>	 T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558
[06:46:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Sorry didn't think about d-prep! We could also force TLS in there, IIRC we have PKI available, but this change is good to avoid puppet bei" [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[06:51:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[06:52:20] <elukey>	 the under replicated partitions is due to the rebalance work
[06:56:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[06:57:15] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T0700).
[07:00:05] <jouncebot>	 _joe_ and Sohom_Datta: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:10] <apergos>	 morning! we have three patches scheduled for deployment, no trainees today though two people are waiting to be scheduled. _joe_ it looks like the two patches you have on the calendar are intended to be dpeloyed in that order, correct? 
[07:00:33] <Sohom_Datta>	 o/
[07:00:51] <apergos>	 Sohom_Datta: your patch will go third. there shuold be plenty of time however.
[07:00:52] <_joe_>	 apergos: please go on with Sohom_Datta first :) and yes
[07:01:05] <apergos>	 oh. well never mind, Sohom_Datta your patch willgo first :-D
[07:01:08] <_joe_>	 apergos: oh ok, I am usually ok being left last :)
[07:01:12] <Sohom_Datta>	 Yeah sure
[07:01:15] <Sohom_Datta>	 Ah okay :)
[07:01:20] <_joe_>	 because in case I can stretch the window :)
[07:01:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786) (owner: 10Sohom Datta)
[07:01:44] <apergos>	 proceeding
[07:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EditInSequence in pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786) (owner: 10Sohom Datta)
[07:02:23] <apergos>	 there is now window schheduled after this one in the next hour, so we can go over time if needed.
[07:02:55] <logmsgbot>	 !log ariel@deploy1002 Started scap: Backport for [[gerrit:939392|Enable EditInSequence in pawikisource]]
[07:03:30] <apergos>	 *is no
[07:04:32] <logmsgbot>	 !log ariel@deploy1002 ariel and soda: Backport for [[gerrit:939392|Enable EditInSequence in pawikisource]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:04:59] <apergos>	 Sohom_Datta: please test your change on mwdebug1002
[07:05:13] <Sohom_Datta>	 on it :)
[07:06:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] modules: Add a new networkpolicy for base modules (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[07:06:49] <Sohom_Datta>	 Tested, looks good to me
[07:07:19] <apergos>	 continuing
[07:11:50] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252)
[07:11:52] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252)
[07:11:54] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252)
[07:12:48] <logmsgbot>	 !log ariel@deploy1002 Finished scap: Backport for [[gerrit:939392|Enable EditInSequence in pawikisource]] (duration: 09m 52s)
[07:12:52] <apergos>	 Sohom_Datta:  please test your change in production.  
[07:13:43] <Sohom_Datta>	 Looks good, Thanks a lot for deploying :)
[07:13:55] <apergos>	 sure thing!
[07:14:17] <apergos>	 _joe_:  shall I proceed with the first of your patches?
[07:14:30] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:14:36] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:14:47] <_joe_>	 apergos: yes, once it's on mwdebug, I'll need a longer testing window for this one, sorry
[07:14:54] <_joe_>	 if you prefer I can do it myself
[07:14:57] <apergos>	 that's fine, take your time
[07:14:58] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:15:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:15:18] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:15:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:15:32] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:15:48] <apergos>	 I should have asked if you preferred to self-deploy. too late now, heh!
[07:15:51] <wikibugs>	 (03Merged) 10jenkins-bot: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:16:16] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:16:19] <logmsgbot>	 !log ariel@deploy1002 Started scap: Backport for [[gerrit:938644|noc: add script to dump etcd db config (T341859)]]
[07:16:23] <stashbot>	 T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859
[07:17:38] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:17:51] <logmsgbot>	 !log ariel@deploy1002 oblivian and ariel: Backport for [[gerrit:938644|noc: add script to dump etcd db config (T341859)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:17:56] <apergos>	 _joe_: plese test on mwdebug1002 and let me know when testing is complete. 
[07:18:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:18:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:19:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:20:07] <_joe_>	 apergos: please proceed
[07:20:20] <apergos>	 continuing
[07:21:26] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:22:20] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:23:04] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:23:36] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:24:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:24:26] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:25:09] <_joe_>	 apergos: you can move to the next patch immediately after. that's the one that might need a revert
[07:25:33] <apergos>	 no testing of the first patch in production, you mean? _joe_
[07:25:49] <_joe_>	 apergos: it was already done, I just scap pull the code to mwmaint :D
[07:25:55] <logmsgbot>	 !log ariel@deploy1002 Finished scap: Backport for [[gerrit:938644|noc: add script to dump etcd db config (T341859)]] (duration: 09m 35s)
[07:25:57] <apergos>	 tsk tsk!
[07:25:59] <stashbot>	 T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859
[07:25:59] <_joe_>	 which is where noc.w.org works
[07:26:02] <wikibugs>	 (03Merged) 10jenkins-bot: CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:26:03] <apergos>	 all right, moving to the next patch
[07:26:05] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:26:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:27:07] <wikibugs>	 (03Merged) 10jenkins-bot: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:27:13] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:27:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[07:27:33] <logmsgbot>	 !log ariel@deploy1002 Started scap: Backport for [[gerrit:938645|noc/db.php: use the new etcd fetch function (T341859)]]
[07:29:05] <logmsgbot>	 !log ariel@deploy1002 oblivian and ariel: Backport for [[gerrit:938645|noc/db.php: use the new etcd fetch function (T341859)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:29:09] <apergos>	 _joe_ the moment of doom has arrived, please test your changes on mwdebug1002 (if possible) 
[07:29:22] <_joe_>	 apergos: you said mwmaint1002, right?
[07:29:38] <apergos>	 I did not; you'll have to pull it there to do so
[07:29:48] <_joe_>	 yeah :)
[07:29:55] <apergos>	 I'll wait :-P
[07:29:58] <_joe_>	 I was just being a smartass :P
[07:30:05] <apergos>	 nothing new there :-P
[07:30:38] <apergos>	 (welcome to the ariel and _joe_ show, a special edition of the backprot-and-training window on this fine UTC morning)
[07:31:02] <jayme>	 🍿
[07:31:13] <_joe_>	 apergos: go on please :)
[07:31:24] <apergos>	 continuing
[07:31:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: stop using script to populate database data URIs [puppet] - 10https://gerrit.wikimedia.org/r/938818 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:36:47] <logmsgbot>	 !log ariel@deploy1002 Finished scap: Backport for [[gerrit:938645|noc/db.php: use the new etcd fetch function (T341859)]] (duration: 09m 14s)
[07:36:51] <apergos>	 _joe_:  your patch is now live in production, please do any additional testing that is needed. 
[07:36:51] <stashbot>	 T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859
[07:37:04] <_joe_>	 apergos: thanks, all done :)
[07:37:09] <apergos>	 ok!
[07:37:27] <apergos>	 everyone here gets their remaining 23 minutes back, the window is concluded
[07:37:45] <apergos>	 !log UTC morning backport and config training window complete
[07:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:47] <_joe_>	 if you can see https://noc.wikimedia.org/db.php my change was successful :)
[07:38:11] <apergos>	 you already know my firefox is acting up, sure you want to risk that test? :-P
[07:38:19] <apergos>	  Notice:  Undefined variable: dbctlJsonByDC in /srv/mediawiki/docroot/noc/db.php on line 136
[07:38:23] <apergos>	  Warning:  Invalid argument supplied for foreach() in /srv/mediawiki/docroot/noc/db.php on line 136
[07:38:24] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:38:37] <apergos>	 Notice:  Undefined variable: dbConfigEtcdJsonFilename in /srv/mediawiki/docroot/noc/db.php on line 144
[07:38:37] <apergos>	 and on 
[07:38:46] <apergos>	 these seem not good, _joe_
[07:38:58] <_joe_>	 no it'
[07:39:11] <_joe_>	 oh 
[07:39:18] <_joe_>	 yeah I must've missed that
[07:39:21] <_joe_>	 the page works though :D
[07:39:25] <apergos>	 lol
[07:39:40] <apergos>	 the window can go for another 20 inutes
[07:39:43] <apergos>	 *minutes
[07:39:52] <apergos>	 shall I re-open? what would you prefer?
[07:39:52] <_joe_>	 yeah I'll fix it and merge as soon as I'm done
[07:39:58] <apergos>	 ok
[07:40:05] <_joe_>	 don't worry, I'll use my root privilege there
[07:40:20] <_joe_>	 where did you see those log messages?
[07:40:22] <apergos>	 !log  UTC morning backport and config training window  reopened for fix to the last noc patch
[07:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:32] <apergos>	 i see them right on the page
[07:40:45] <apergos>	 I went to the link you so kindly provided and they were at the top
[07:40:53] <_joe_>	 sigh they didn't appear on my test
[07:41:10] <apergos>	 Notice:  Undefined variable: dbConfigEtcdJsonFilename in /srv/mediawiki/docroot/noc/db.php on line 145     one more message
[07:41:13] <apergos>	 so that's all 4
[07:41:39] <wikibugs>	 (03PS1) 10JMeybohm: function-orchestrator: Fix service name and port for function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940087 (https://phabricator.wikimedia.org/T297314)
[07:42:03] <apergos>	 when things are fixed up again please let me know and I'll close the window again
[07:42:21] <apergos>	 and if you need anything please ping, I'll be watching here of course
[07:45:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] function-orchestrator: Fix service name and port for function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940087 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:45:57] <wikibugs>	 (03Merged) 10jenkins-bot: function-orchestrator: Fix service name and port for function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940087 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[07:46:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: noc: fix other references to old files. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940089
[07:46:19] <_joe_>	 apergos: ^^
[07:46:35] <apergos>	 self-deploy right? (I'm happy to do it though)
[07:47:08] <apergos>	 _joe_: 
[07:47:17] <_joe_>	 apergos: sure
[07:47:32] <apergos>	 ok! please add the patch to the dpeloyment calendar too for the record
[07:47:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: fix other references to old files. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940089 (owner: 10Giuseppe Lavagetto)
[07:48:18] <_joe_>	 yeah
[07:48:52] <wikibugs>	 (03Merged) 10jenkins-bot: noc: fix other references to old files. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940089 (owner: 10Giuseppe Lavagetto)
[07:50:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:41] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[07:50:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[07:51:24] <_joe_>	 apergos: is the thing fixed for you too?
[07:51:35] <apergos>	 lookng
[07:52:05] <apergos>	 the errors on the page are gone
[07:52:33] <_joe_>	 ack
[07:53:03] <_joe_>	 I am not going to sync-file this, as there is no reason to cause another global restart of php-fpm 
[07:53:16] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:53:50] <apergos>	 I wouldn't worry about it
[07:54:01] <apergos>	 the window is for up to 6 patches each with their wn php fpm restart :-P
[07:54:07] <apergos>	 *own
[07:56:55] <apergos>	 !log UTC morning backport and config training window really complete
[07:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi)
[07:57:24] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:15] <wikibugs>	 (03PS4) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527)
[08:02:16] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:02:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release wikifunctions/main-evaluator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:02:41] <godog>	 checking
[08:03:36] <godog>	 mmhh the fpm workers for parsoid have been creeping up to the limit, from the dashboard link
[08:03:56] <godog>	 I'm looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-2d&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus%2Fops&viewPanel=64&refresh=1m
[08:04:20] <godog>	 I'm guessing this is the issue with parsoid "slow" since last sat, what do you think _joe_  ?
[08:04:44] <godog>	 T342085 that is
[08:04:45] <stashbot>	 T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085
[08:07:24] <godog>	 !incidents
[08:07:24] <sirenbot>	 You're not allowed to perform this action.
[08:07:32] <godog>	 sigh I forgot to ack
[08:07:46] * Emperor here
[08:08:03] <_joe_>	 godog: probably, yes
[08:08:06] <Emperor>	 ah, I see
[08:08:28] <_joe_>	 godog: we're also down another server
[08:08:47] <godog>	 ah that would contribute for sure, thank you
[08:09:00] <godog>	 but yeah I don't think we're in immediate danger
[08:09:36] <godog>	 even though parsoid requests are going up
[08:10:11] <wikibugs>	 (03PS1) 10JMeybohm: mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033)
[08:10:12] <godog>	 hah that seems the daily cycle
[08:10:19] <_joe_>	 yes
[08:10:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[08:11:13] <godog>	 ok yeah we've been hovering around the paging limit for days now, I'm looking at 
[08:11:16] <godog>	 https://prometheus-eqiad.wikimedia.org/ops/classic/graph?g0.range_input=1w&g0.expr=sum%20by(cluster%2C%20service)%20(phpfpm_statustext_processes%7Bcluster%3D~%22(api_appserver%7Cappserver%7Cparsoid)%22%2Cstate%3D%22idle%22%7D)%20%2F%20sum%20by(cluster%2C%20service)%20(phpfpm_statustext_processes%7Bcluster%3D~%22(api_appserver%7Cappserver%7Cparsoid)%22%7D)%20&g0.tab=0
[08:11:22] <godog>	 page is at <= 0.3
[08:12:57] <godog>	 now I'm wondering if we can handle parsoid traffic today with parse1002 not working and parsoid latencies high ?
[08:13:52] <wikibugs>	 (03PS2) 10JMeybohm: mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033)
[08:16:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[08:17:54] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[08:21:30] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[08:21:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[08:22:16] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:23:09] <wikibugs>	 (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-esams [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211)
[08:23:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:23:45] <godog>	 the page is bound to fire again today I think btw
[08:23:56] <godog>	 jbond: ^ FYI
[08:24:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:24:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[08:25:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[08:27:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release wikifunctions/main-evaluator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:27:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[08:28:16] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:28:19] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42598/console" [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[08:28:26] <godog>	 as expected
[08:28:43] <godog>	 acked
[08:29:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[08:29:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[08:30:19] <godog>	 ok so current options: tweak the threshold for parsoid only, add capacity to parsoid, disable pre-gen as suggested on T342085 (not mutually exclusive)
[08:30:19] <stashbot>	 T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085
[08:30:28] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-esams [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[08:31:40] <fabfur>	 !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940091 (T342211) to esams DC (disable keepalive on port 80)
[08:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:44] <stashbot>	 T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211
[08:33:11] <jbond>	 hi godog sorry was afk, reading backscroll now
[08:33:16] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:33:18] <godog>	 hi jbond 
[08:33:59] <godog>	 yeah nothing is on fire immediately btw, I think we should be adding more capacity to parsoid temporarily, and trying to find how to do that
[08:34:23] <jbond>	 ack
[08:36:16] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:36:41] <godog>	 ok I'm silencing the alert for a while
[08:36:45] <jbond>	 +1
[08:37:33] <godog>	 silenced for 4h
[08:37:52] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9019664, @JMeybohm wro...
[08:43:16] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[08:44:47] <hashar>	 jelto: funnilly I just thought about that Apache change :)
[08:45:36] <jelto>	 hah :) I'll merge this now and test it, I have httpbb tests ready and integration open in the browser
[08:46:40] <hashar>	 given the config changes are probably not covered by the httpbb suite
[08:51:05] <jelto>	 merged and apache reloaded. Still looks good
[08:53:15] <wikibugs>	 (03PS1) 10Hashar: httpbb: test Gearman testConnection is forbidden [puppet] - 10https://gerrit.wikimedia.org/r/940092 (https://phabricator.wikimedia.org/T219991)
[08:53:20] <hashar>	 jelto: ^ :)
[08:53:34] <hashar>	 that should covers one of the rule
[08:55:23] <wikibugs>	 (03CR) 10Gehel: "Minor comment inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[08:56:27] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/940092 (https://phabricator.wikimedia.org/T219991) (owner: 10Hashar)
[08:59:09] <jelto>	 httpbb test look good, all pass (and one test more in total) :)
[08:59:20] <hashar>	 awesome thank you!
[09:01:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) >>! In T341992#9029300, @RobH wrote: > Cool, I understand now.  I'll move and update netbox/homer for these two hosts tomorrow to move them to 10G configured ports 44/45  I renumbered...
[09:03:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085)
[09:03:56] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: auto link existing users with OIDC [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[09:04:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi)
[09:09:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You need at least to set the "cluster" hiera variable to "parsoid"." [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi)
[09:12:01] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[09:13:17] <wikibugs>	 (03PS2) 10Filippo Giunchedi: conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085)
[09:15:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/940095/42599/" [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi)
[09:15:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42600/console" [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi)
[09:15:50] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001"
[09:17:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001"
[09:17:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:17:44] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[09:19:40] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=mw1356.eqiad.wmnet
[09:19:48] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=mw1357.eqiad.wmnet
[09:21:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi)
[09:21:55] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto)
[09:22:34] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) p:05Medium→03High
[09:22:50] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) I deployed the change above which adds this two lines to the `gitlab.rb` file:  ` gitlab_rails['omniauth_auto_link_user'] =...
[09:24:08] <wikibugs>	 (03PS12) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826)
[09:24:10] <wikibugs>	 (03PS6) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826)
[09:24:12] <wikibugs>	 (03PS15) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[09:24:14] <wikibugs>	 (03PS7) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[09:24:15] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001"
[09:25:08] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=10; selector: name=mw1356.eqiad.wmnet
[09:25:14] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw1356.eqiad.wmnet
[09:25:56] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=10; selector: name=mw1357.eqiad.wmnet
[09:26:01] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw1357.eqiad.wmnet
[09:27:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[09:28:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[09:29:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Jelto)
[09:30:32] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[09:35:25] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42601/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:36:39] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42603/console" [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:36:41] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42602/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[09:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:48:30] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[09:50:17] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001"
[09:50:17] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:54:40] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[09:56:19] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab: move gitlab to test idp" [puppet] - 10https://gerrit.wikimedia.org/r/939345
[10:00:04] <jouncebot>	 mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1000)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1000)
[10:01:25] <wikibugs>	 (03CR) 10Jbond: "thanks response inline" [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[10:10:50] <wikibugs>	 (03PS1) 10Btullis: Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641)
[10:15:24] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42604/console" [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[10:15:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:20:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:23:43] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/940099 (https://phabricator.wikimedia.org/T340822) (owner: 10Klausman)
[10:25:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: Bump revertrisk-la to latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/940099 (https://phabricator.wikimedia.org/T340822) (owner: 10Klausman)
[10:26:36] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Bump revertrisk-la to latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/940099 (https://phabricator.wikimedia.org/T340822) (owner: 10Klausman)
[10:29:24] <wikibugs>	 (03PS7) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[10:29:29] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[10:33:06] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[10:40:44] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[10:41:36] <wikibugs>	 (03PS59) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[10:45:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:49:52] <wikibugs>	 (03CR) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[10:50:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:53:59] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[10:55:03] <wikibugs>	 10SRE-swift-storage: Remove / tidy up old swiftrepl code - https://phabricator.wikimedia.org/T342334 (10MatthewVernon)
[11:00:47] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42605/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[11:01:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] ceph: Add puppet management of OSDs on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[11:01:55] <wikibugs>	 (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211)
[11:03:17] <wikibugs>	 (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211)
[11:05:17] <wikibugs>	 (03PS2) 10Jbond: proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214)
[11:06:48] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42606/console" [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[11:07:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:08:44] <wikibugs>	 (03PS1) 10MVernon: Get rid of swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334)
[11:10:35] <wikibugs>	 (03PS1) 10MVernon: Remove swiftrepl [software] - 10https://gerrit.wikimedia.org/r/940105 (https://phabricator.wikimedia.org/T342334)
[11:11:46] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon)
[11:14:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe)
[11:14:49] <wikibugs>	 (03PS3) 10Jbond: proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214)
[11:15:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe)
[11:16:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe)
[11:16:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42608/console" [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:16:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[11:17:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "not too familiar with Swift::Swiftrepl but it seems unused and the CR seems like a noop to me" [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon)
[11:17:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42609/console" [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon)
[11:17:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) @andrea.denisse, the new request is here: T342335 Thanks
[11:18:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm (although seems strange to not include this in 940103)" [software] - 10https://gerrit.wikimedia.org/r/940105 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon)
[11:18:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:18:56] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[11:19:35] <wikibugs>	 (03PS1) 10Btullis: Fix the device name when running parted on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940109 (https://phabricator.wikimedia.org/T330151)
[11:20:06] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/939283
[11:21:00] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42610/console" [puppet] - 10https://gerrit.wikimedia.org/r/940109 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[11:21:55] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the device name when running parted on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940109 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[11:22:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:28:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:29:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:34:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:35:27] <fabfur>	 !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940101 (T342211) to eqsin DC (disable keepalive on port 80 on A:cp-eqsin)
[11:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:31] <stashbot>	 T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211
[11:35:40] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[11:39:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:44:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:49:08] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Get rid of swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon)
[11:49:35] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Remove swiftrepl [software] - 10https://gerrit.wikimedia.org/r/940105 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon)
[11:50:03] <wikibugs>	 (03CR) 10Joal: "Comments about possibly updating default settings" [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[11:50:52] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[11:50:57] <wikibugs>	 (03PS1) 10Dreamy Jazz: SpecialUserRights: Check for username to be temporary [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468)
[11:51:04] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Remove / tidy up old swiftrepl code - https://phabricator.wikimedia.org/T342334 (10MatthewVernon)
[11:52:08] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Remove / tidy up old swiftrepl code - https://phabricator.wikimedia.org/T342334 (10MatthewVernon) 05Open→03Resolved
[11:52:29] <urbanecm>	 jouncebot: nowandnext
[11:52:29] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 7 minute(s)
[11:52:29] <jouncebot>	 In 1 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1300)
[11:52:29] <jouncebot>	 In 1 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1300)
[11:52:52] <urbanecm>	 zabe: are you around to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/940126/ please? or should i go ahead with that?
[11:52:54] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch loopbacks. - cmooney@cumin1001"
[11:53:35] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch loopbacks. - cmooney@cumin1001"
[11:53:35] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:53:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] SpecialUserRights: Check for username to be temporary [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) (owner: 10Dreamy Jazz)
[11:53:42] <urbanecm>	 i guess i can +2 it anyway.
[11:55:51] <wikibugs>	 (03PS1) 10Btullis: Do not attempt to use hdparm on nvme drives for cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940116 (https://phabricator.wikimedia.org/T330151)
[11:57:29] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42611/console" [puppet] - 10https://gerrit.wikimedia.org/r/940116 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[11:58:10] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05Open→03In progress p:05Triage→03Medium
[11:58:15] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[11:58:52] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) a:05Clement_Goubert→03Joe
[11:59:41] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe)
[12:01:46] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:30] <zabe>	 urbanecm: I can do it I guess
[12:06:31] <urbanecm>	 zabe: feel free to finidh it; i can deploy if needed, but I'd prefer someone else doing it. Thank you!
[12:07:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) (owner: 10Dreamy Jazz)
[12:08:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) links moved, servers online for remote os installation.
[12:11:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, but I'm not 100% sure about the service catalog change" [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney)
[12:11:32] <wikibugs>	 (03PS2) 10Btullis: Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641)
[12:11:46] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialUserRights: Check for username to be temporary [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) (owner: 10Dreamy Jazz)
[12:11:48] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[12:12:13] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:940126|SpecialUserRights: Check for username to be temporary (T340468 T342322)]]
[12:12:15] <wikibugs>	 (03CR) 10Btullis: Enable local caching for presto on the test cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[12:12:19] <stashbot>	 T340468: Throw an error from UserGroupManager::addUserToGroup if called on a temporary user - https://phabricator.wikimedia.org/T340468
[12:12:19] <stashbot>	 T342322: Unable to use interwiki Special:UserRights - https://phabricator.wikimedia.org/T342322
[12:12:34] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[12:13:38] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:13:49] <logmsgbot>	 !log zabe@deploy1002 zabe and dreamyjazz: Backport for [[gerrit:940126|SpecialUserRights: Check for username to be temporary (T340468 T342322)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[12:15:50] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch gw ips. - cmooney@cumin1001"
[12:17:06] <wikibugs>	 (03PS3) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[12:19:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:19:38] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:21] <wikibugs>	 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) All cp hosts in esams and eqsin have keep-alive disabled on port 80.  Drop in number of sessions on port 80:  {F37144516}  {F37144519}  The number of (correctly redirected) requests managed by those h...
[12:20:35] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:940126|SpecialUserRights: Check for username to be temporary (T340468 T342322)]] (duration: 08m 22s)
[12:20:40] <stashbot>	 T340468: Throw an error from UserGroupManager::addUserToGroup if called on a temporary user - https://phabricator.wikimedia.org/T340468
[12:20:40] <stashbot>	 T342322: Unable to use interwiki Special:UserRights - https://phabricator.wikimedia.org/T342322
[12:22:07] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch gw ips. - cmooney@cumin1001"
[12:22:07] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:23:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] admin: add Ifrah Khanyaree (WMDE) to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/938222 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney)
[12:24:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:24:52] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[12:25:03] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[12:25:49] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: disable unprivileged_userns [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334)
[12:26:08] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[12:27:59] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab_runner: disable unprivileged_userns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) (owner: 10Jelto)
[12:30:26] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Aklapper) @Cyndymediawiksim Uhmmm. I am sorry for the hassle. [The Phabricator account now shows no authentication factors](https://phabricator.wikimedia.org/p/Cyndymediawiksim/) so it will no...
[12:31:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:33:01] <wikibugs>	 (03PS1) 10Cathal Mooney: Change username to match ldap one [puppet] - 10https://gerrit.wikimedia.org/r/940118 (https://phabricator.wikimedia.org/T341455)
[12:36:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:39:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:41:04] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:41:17] <urbanecm>	 thanks again for the deploy zabe. it seems to work now.
[12:41:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change username to match ldap one [puppet] - 10https://gerrit.wikimedia.org/r/940118 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney)
[12:42:17] <wikibugs>	 (03PS3) 10Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495)
[12:42:22] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Reduce the frequency of flaggedrevs updates [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup)
[12:43:16] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[12:43:25] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[12:43:47] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[12:43:58] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[12:44:34] <topranks>	 !log LDAP - adding user ifrahkh to groups wmde & nda 
[12:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:45:29] <wikibugs>	 (03PS8) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[12:46:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[12:46:04] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:46:53] <wikibugs>	 (03PS9) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[12:46:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) @Ifrahkhanyaree you've now been added to the required LDAP groups (username ifrahkh).  Please try to access the systems you need and advise if there...
[12:47:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[12:47:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:47:29] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) A basic check of the orchestra...
[12:47:43] <wikibugs>	 (03Merged) 10jenkins-bot: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[12:53:02] <wikibugs>	 (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney)
[12:54:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:56:42] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[12:56:52] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[12:59:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @btullis replaced cable on analytics1073 & analytics1075
[13:00:09] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[13:00:20] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[13:04:31] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:06:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:07:14] <wikibugs>	 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur)
[13:08:49] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10cmassaro) That error is from our top-level, las...
[13:09:16] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001"
[13:09:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:10:01] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001"
[13:10:01] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:11:49] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[13:12:42] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[13:12:52] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[13:13:26] <wikibugs>	 (03PS1) 10Cathal Mooney: Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230)
[13:13:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) p:05Triage→03Medium
[13:14:04] <wikibugs>	 (03CR) 10Vgutierrez: Remove references to releases1002/releases2002 for decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney)
[13:14:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[13:16:45] <wikibugs>	 (03PS4) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[13:16:47] <wikibugs>	 (03PS2) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214)
[13:17:10] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[13:17:21] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[13:18:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:18:16] <logmsgbot>	 !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[13:18:50] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:21:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:21:42] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[13:21:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001"
[13:22:41] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001"
[13:22:41] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:24:30] <wikibugs>	 (03PS2) 10Cathal Mooney: Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230)
[13:24:36] <wikibugs>	 (03PS1) 10JMeybohm: wikifunctions: Both charts are required to use readOnlyRootFilesystem [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314)
[13:25:34] <wikibugs>	 (03CR) 10JMeybohm: "Feel free to deploy anytime" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[13:31:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:32:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:34:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[13:35:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[13:35:30] <wikibugs>	 (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211)
[13:37:33] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar)
[13:37:53] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42612/console" [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[13:39:43] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) 05Declined→03Open To upgrade the OS on CI s...
[13:43:42] <wikibugs>	 (03PS1) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785)
[13:44:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[13:44:36] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm (this time with actually clicking on +1)" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney)
[13:44:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:45:14] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye
[13:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:46:37] <wikibugs>	 (03PS2) 10Btullis: Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716)
[13:47:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:48:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:49:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716) (owner: 10Btullis)
[13:49:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:51:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:52:09] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] Run LDAP group sync periodically on gitlab replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[13:52:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:53:08] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: prevent malformed config when tls_cluster_name is unset [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[13:53:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:54:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:54:11] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[13:54:13] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[13:55:46] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:55:49] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[13:56:02] <wikibugs>	 (03PS2) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785)
[13:57:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons.
[13:57:22] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm
[13:57:29] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[13:58:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:58:25] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[13:58:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[13:59:19] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) > typing go completes the installation correctly correction typing go and then doing a reboot via install_console allows things to complete
[13:59:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:59:53] <wikibugs>	 (03PS7) 10Jelto: Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[14:01:26] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[14:01:38] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[14:03:45] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42613/console" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[14:04:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:04:04] <wikibugs>	 (03PS3) 10Cathal Mooney: Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230)
[14:04:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:04:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:05:19] <wikibugs>	 (03PS3) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785)
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:01] <wikibugs>	 (03PS5) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[14:12:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:13:28] <sukhe>	 !log dns1004 upgrade to pdns-rec 4.8.4: T341611
[14:13:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:31] <stashbot>	 T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611
[14:13:33] <wikibugs>	 (03Abandoned) 10Jforrester: [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester)
[14:13:58] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[14:14:59] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[14:15:10] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[14:15:35] <wikibugs>	 (03PS1) 10Hashar: python-build: set date of source files in the wheel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157
[14:15:40] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1075.eqiad.wmnet with OS bullseye
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:12] <wikibugs>	 (03Abandoned) 10Ssingh: Revert "depool esams: router migration" [dns] - 10https://gerrit.wikimedia.org/r/938678 (owner: 10Ssingh)
[14:17:37] <wikibugs>	 (03PS2) 10Hashar: python-build: set date of source files in the wheel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346)
[14:17:42] <wikibugs>	 (03CR) 10Hashar: "That is probably not a high priority but that makes it easier to review differences when refreshing dependencies of an otherwise untouched" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[14:18:44] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Do not attempt to use hdparm on nvme drives for cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940116 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[14:21:11] <wikibugs>	 (03PS1) 10Bking: netboot.cfg: explicitly define partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/940160 (https://phabricator.wikimedia.org/T341705)
[14:21:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] dnsrecursor: allow configuring the webserver loglevel [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh)
[14:22:04] <wikibugs>	 (03PS3) 10Hashar: python-build: set date of source files in the wheel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346)
[14:22:45] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[14:25:35] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) I have rebuild sretest1002 again and things now work.  I have build things a few times (logs in cumin1001:~cookbook-testing/logs) and so far it seems a bit random when...
[14:25:53] <wikibugs>	 (03PS1) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346)
[14:26:48] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "I am sure you know but you can run agent on A:installserver to make sure that the changes are picked up immediately!" [puppet] - 10https://gerrit.wikimedia.org/r/940160 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[14:27:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:27:35] <wikibugs>	 (03CR) 10Bking: [C: 03+2] netboot.cfg: explicitly define partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/940160 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[14:29:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: allow configuring the webserver loglevel [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh)
[14:30:14] <wikibugs>	 (03PS4) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435)
[14:30:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[14:30:41] <sukhe>	 !log disable puppet on A:dns-rec to slowly roll out CR 937991
[14:30:41] <wikibugs>	 (03PS4) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785)
[14:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:07] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1075.eqiad.wmnet with reason: host reimage
[14:32:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:32:43] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:32:49] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[14:33:06] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:33:18] <sukhe>	 ^ expected
[14:33:30] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:33:53] <wikibugs>	 (03PS1) 10Jelto: gitlab: make sure ldap_group_sync_user is created first [puppet] - 10https://gerrit.wikimedia.org/r/940162 (https://phabricator.wikimedia.org/T319211)
[14:34:11] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1075.eqiad.wmnet with reason: host reimage
[14:34:32] <wikibugs>	 (03CR) 10Herron: [C: 03+2] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/939326 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[14:36:13] <sukhe>	 !log run agent on cumin -b1 -s30 'A:dns-rec and not P{dns4004*}' 
[14:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) a:05RobH→03Vgutierrez Ready for installation!
[14:37:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:38:03] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:38:05] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:38:11] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[14:38:58] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:39:04] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:39:07] <wikibugs>	 (03CR) 10JHathaway: puppetserver: do not notify puppetserver service on changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:40:31] <wikibugs>	 (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[14:41:13] <logmsgbot>	 !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:41:16] <logmsgbot>	 !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:42:34] <wikibugs>	 (03CR) 10Hashar: "Eventually I went to implement it using PIP_FIND_LINKS:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar)
[14:44:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney)
[14:45:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts analytics1073.eqiad.wmnet
[14:45:41] <herron>	 !log roll restart codfw/eqiad low-traffic pybals to add prometheus-https T326657
[14:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:45] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[14:45:48] <wikibugs>	 (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Both charts are required to use readOnlyRootFilesystem [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[14:45:54] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts analytics1073.eqiad.wmnet
[14:46:05] <wikibugs>	 10SRE, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10aborrero) a:05Jclark-ctr→03aborrero the DC-ops part is done.
[14:46:38] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Both charts are required to use readOnlyRootFilesystem [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[14:47:22] <wikibugs>	 (03CR) 10Cory Massaro: "Thank you! I'd like to run a quick test with these AppArmor profiles, one moment." [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[14:47:41] <wikibugs>	 (03CR) 10Hashar: "That one should be straightforward.  zuul-gearman.py is a script I wrote ages ago to inspect the Gearman server.  I have found out there i" [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172) (owner: 10Hashar)
[14:48:48] <wikibugs>	 (03CR) 10Jbond: puppetserver: do not notify puppetserver service on changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[14:50:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[14:50:22] <logmsgbot>	 !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host analytics1073.eqiad.wmnet with OS bullseye
[14:50:29] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye
[14:51:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "see vg's nit but looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[14:51:13] <wikibugs>	 (03CR) 10Cory Massaro: [C: 03+1] "These profiles work in my testing, so I'm happy to approve!" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[14:51:17] <wikibugs>	 (03PS1) 10Cwhite: logstash: move labels.trace to error.stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/939285 (https://phabricator.wikimedia.org/T339137)
[14:51:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk1003.eqiad.wmnet with reason: host reimage
[14:52:13] <wikibugs>	 (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211)
[14:54:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[14:54:30] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney)
[14:55:12] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk1003.eqiad.wmnet with reason: host reimage
[14:56:13] <logmsgbot>	 !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:56:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Reopening ticket with dell
[14:56:46] <logmsgbot>	 !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:57:13] <wikibugs>	 (03CR) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[14:57:30] <wikibugs>	 (03PS1) 10Jelto: gitlab: add ldap sync token [labs/private] - 10https://gerrit.wikimedia.org/r/940176 (https://phabricator.wikimedia.org/T319211)
[14:57:33] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[14:58:32] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: add ldap sync token [labs/private] - 10https://gerrit.wikimedia.org/r/940176 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[14:58:36] <fabfur>	 !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940150 (T342211) to ulsfo DC (disable keepalive on port 80 on A:cp-ulsfo)
[14:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:39] <stashbot>	 T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211
[14:58:44] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1075.eqiad.wmnet with OS bullseye
[14:59:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur)
[14:59:52] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] gitlab: add ldap sync token [labs/private] - 10https://gerrit.wikimedia.org/r/940176 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[15:02:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) 05Open→03Resolved Many thanks to all concerned. These hosts now have regained connectivity and have been upgraded to 10 G...
[15:02:53] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: make sure ldap_group_sync_user is created first (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940162 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[15:02:58] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:47] <wikibugs>	 (03PS1) 10Btullis: Stop repeatedly disabling the write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940178 (https://phabricator.wikimedia.org/T330151)
[15:04:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:05:03] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Stop repeatedly disabling the write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940178 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[15:05:10] <wikibugs>	 (03PS1) 10Jelto: gitlab: move gitlab::ldap_group_sync_bot_token to private puppet [puppet] - 10https://gerrit.wikimedia.org/r/940179 (https://phabricator.wikimedia.org/T319211)
[15:05:12] <wikibugs>	 (03PS6) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[15:05:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) a:03andrea.denisse
[15:05:38] <wikibugs>	 (03PS1) 10Bking: wdqs: fix missing entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940180 (https://phabricator.wikimedia.org/T332314)
[15:05:48] <wikibugs>	 (03PS1) 10Cathal Mooney: Add static yaml data for new eqiad leaf devices [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230)
[15:06:47] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bullseye
[15:06:50] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42615/console" [puppet] - 10https://gerrit.wikimedia.org/r/940179 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[15:06:58] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[15:07:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[15:07:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[15:07:52] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: make sure ldap_group_sync_user is created first [puppet] - 10https://gerrit.wikimedia.org/r/940162 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[15:08:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage
[15:08:19] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: move gitlab::ldap_group_sync_bot_token to private puppet [puppet] - 10https://gerrit.wikimedia.org/r/940179 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[15:11:15] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage
[15:12:25] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:12:51] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: fix missing entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940180 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking)
[15:13:14] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: fix missing entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940180 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking)
[15:14:51] <wikibugs>	 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur)
[15:15:49] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk1003.eqiad.wmnet with OS bookworm
[15:16:16] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm completed:...
[15:17:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[15:18:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[15:18:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes: add mw-misc "service" [puppet] - 10https://gerrit.wikimedia.org/r/940186 (https://phabricator.wikimedia.org/T341859)
[15:19:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:20:24] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[15:20:35] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[15:28:13] <icinga-wm>	 PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:19] <hashar>	 ^^  rsync: failed to connect to releases1003.eqiad.wmnet: Connection timed out (110)
[15:30:28] <hashar>	 it runs every 10 minutes so I guess that will self recover
[15:30:38] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:15] <elukey>	 !log stop kafka main eqiad maintenance - T341558
[15:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:20] <stashbot>	 T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558
[15:31:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356)
[15:34:31] <wikibugs>	 (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-codfw [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211)
[15:36:40] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42616/console" [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[15:45:00] <wikibugs>	 (03PS7) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[15:46:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[15:46:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[15:46:33] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[15:46:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[15:48:04] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm
[15:48:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[15:48:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[15:49:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
[15:50:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
[15:50:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add static yaml data for new eqiad leaf devices [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[15:51:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[15:51:01] <wikibugs>	 (03PS1) 10Btullis: Run the ceph osd execs with the shell provider [puppet] - 10https://gerrit.wikimedia.org/r/940192 (https://phabricator.wikimedia.org/T330151)
[15:51:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[15:51:59] <wikibugs>	 (03PS2) 10Btullis: Run the ceph osd execs with the shell provider [puppet] - 10https://gerrit.wikimedia.org/r/940192 (https://phabricator.wikimedia.org/T330151)
[15:54:06] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) i wonder iff this is because im running this over and over again in quick succession and possibly fingerprint checking is getting in the way as i get the following from...
[15:59:59] <icinga-wm>	 PROBLEM - Host elastic2086 is DOWN: PING CRITICAL - Packet loss = 100%
[16:00:04] <jouncebot>	 jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:17] <icinga-wm>	 RECOVERY - Host elastic2086 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms
[16:00:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol1005: add role with new domain [puppet] - 10https://gerrit.wikimedia.org/r/940194 (https://phabricator.wikimedia.org/T341495)
[16:00:57] <icinga-wm>	 RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[16:02:17] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[16:03:32] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
[16:03:57] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcontrol1005
[16:04:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:27] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[16:07:14] <icinga-wm>	 PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:16] <icinga-wm>	 RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:11:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Run the ceph osd execs with the shell provider [puppet] - 10https://gerrit.wikimedia.org/r/940192 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[16:13:01] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye
[16:13:29] <jinxer-wm>	 (ProbeDown) firing: Service vrts2001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts2001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:54] <icinga-wm>	 PROBLEM - Check systemd state on vrts2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cron.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye
[16:15:46] <icinga-wm>	 RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage
[16:18:16] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[16:18:33] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage
[16:20:20] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005 - aborrero@cumin1001"
[16:21:02] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005 - aborrero@cumin1001"
[16:21:02] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:21:33] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bullseye
[16:21:46] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[16:21:50] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005
[16:22:13] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1005
[16:25:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol1005: add role with new domain [puppet] - 10https://gerrit.wikimedia.org/r/940194 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez)
[16:27:14] <icinga-wm>	 PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:36] <icinga-wm>	 PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:29] <wikibugs>	 (03PS1) 10Cory Massaro: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196
[16:30:02] <icinga-wm>	 RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:08] <icinga-wm>	 RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: eqiad1: cloudcontrol1005: load cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/940197 (https://phabricator.wikimedia.org/T341495)
[16:30:20] <wikibugs>	 (03PS2) 10Cory Massaro: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196
[16:31:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] eqiad1: cloudcontrol1005: load cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/940197 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez)
[16:31:47] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bullseye
[16:34:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859)
[16:34:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199
[16:36:10] <wikibugs>	 (03PS8) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[16:36:12] <wikibugs>	 (03PS3) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214)
[16:37:29] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1005.eqiad.wmnet with OS bullseye
[16:37:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bullseye
[16:38:22] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[16:38:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[16:40:55] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm
[16:41:01] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1013.eqiad.wmnet with OS bookworm
[16:42:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1001.eqiad.wmnet with OS bookworm
[16:42:57] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm
[16:43:01] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm
[16:44:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10nskaggs) Taavi is a valued contributor to Wikimedia and it's projects for over 4 years, and is the current Tech contributor of the year for Wikimedia. He possesses both the requisite skills and knowl...
[16:46:49] <wikibugs>	 (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[16:47:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-codfw [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[16:47:56] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1073.eqiad.wmnet with OS bullseye
[16:48:02] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-codfw [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[16:48:58] <fabfur>	 !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940190 (T342211) to codfw DC (disable keepalive on port 80 on A:cp-codfw)
[16:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:02] <stashbot>	 T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211
[16:49:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts analytics1075.eqiad.wmnet
[16:49:48] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts analytics1075.eqiad.wmnet
[16:49:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur)
[16:51:48] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage
[16:52:15] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[16:52:26] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[16:53:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[16:54:47] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10BTullis) I don't know if you want a new ticket for this, but I'm seeing errors from the `upgrade-firmware` cookbook when run against some hosts with the old...
[16:55:12] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage
[16:56:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk1001.eqiad.wmnet with reason: host reimage
[16:57:41] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage
[16:59:51] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[17:00:06] <jouncebot>	 bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1700)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1700)
[17:00:14] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk1001.eqiad.wmnet with reason: host reimage
[17:03:17] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/940201
[17:03:28] <wikibugs>	 (03Restored) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[17:05:10] <wikibugs>	 (03PS1) 10Dwisehaupt: Remove frav1002 monitoring, add it for frav1003 [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064)
[17:07:22] <wikibugs>	 (03PS9) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[17:07:33] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Remove frav1002 monitoring, add it for frav1003 [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt)
[17:08:34] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] "Looks good to me, ready to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt)
[17:09:02] <wikibugs>	 (03PS4) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214)
[17:09:50] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[17:09:55] * bd808 will not be trying to deploy anything today
[17:10:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[17:12:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[17:18:04] <icinga-wm>	 PROBLEM - Zookeeper Server on flink-zk1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[17:20:39] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10Papaul) @BTullis the firmware cookbook will failed if the IDRAC version is too old. You need to get it to a minimum version for the API to work.
[17:21:03] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[17:21:08] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bookworm
[17:21:50] <wikibugs>	 (03PS1) 10Jbond: ssh::publich_fingrprints: also link .ecdsa  file [puppet] - 10https://gerrit.wikimedia.org/r/940227
[17:22:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ssh::publich_fingrprints: also link .ecdsa  file [puppet] - 10https://gerrit.wikimedia.org/r/940227 (owner: 10Jbond)
[17:22:15] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk1001.eqiad.wmnet with OS bookworm
[17:22:21] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm completed:...
[17:22:53] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[17:23:09] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH)
[17:23:17] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH)
[17:25:18] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[17:25:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1002.eqiad.wmnet with OS bookworm
[17:25:30] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm
[17:28:44] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10RobH)
[17:28:52] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10RobH)
[17:36:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk1002.eqiad.wmnet with reason: host reimage
[17:39:46] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk1002.eqiad.wmnet with reason: host reimage
[17:41:05] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bullseye
[17:41:16] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[17:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:55:52] <wikibugs>	 (03CR) 10Jforrester: "No change to image value?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro)
[18:00:04] <jouncebot>	 dancy and dduvall: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1800).
[18:00:19] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk1002.eqiad.wmnet with OS bookworm
[18:00:25] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm completed:...
[18:05:38] <wikibugs>	 (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[18:08:06] <wikibugs>	 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) `pdns-recursor  4.8.4-1+wmf11u1` has been running in production on the following hosts for a while:  dns1004, 2004, 4003, 5003 doh6001  No issues observed, so we will rolling out to all hosts that use it on...
[18:08:26] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron)
[18:11:21] <wikibugs>	 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh)
[18:11:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Papaul finished double checking that I got everything like we discussed. All the firmware is up to date and the NIC issues have been solved. Can you pleas...
[18:15:08] <dancy>	 The train is blocked on https://phabricator.wikimedia.org/T342282.  Please help to unblock if you can!
[18:37:13] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940238 (https://phabricator.wikimedia.org/T340246)
[18:37:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940238 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[18:37:15] <dancy>	 Unblocked!
[18:37:55] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940238 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[18:38:02] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:38:05] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[18:38:11] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:41:06] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes,set/weight=10; selector: name=wdqs2013-19.codfw.wmnet
[18:42:46] <wikibugs>	 (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[18:43:11] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2013.codfw.wmnet
[18:43:11] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2014.codfw.wmnet
[18:43:12] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2015.codfw.wmnet
[18:43:19] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2016.codfw.wmnet
[18:43:28] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2017.codfw.wmnet
[18:43:35] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2018.codfw.wmnet
[18:43:38] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2019.codfw.wmnet
[18:43:52] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs20{20}.codfw.wmnet
[18:44:33] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2020.codfw.wmnet
[18:44:42] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.18  refs T340246
[18:44:46] <stashbot>	 T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246
[18:45:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:46:52] <rzl>	 inflatador: fyi in case it makes your life easier next time, it takes a regex, so wdqs20(1[3-9]|20)\.codfw\.wmnet :)
[18:49:19] <inflatador>	 rzl thanks, I made a few feeble attempts using cumin syntax to no avail ;)
[18:50:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:51:28] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: re-enable alerting on now-in-svc hosts [puppet] - 10https://gerrit.wikimedia.org/r/940240 (https://phabricator.wikimedia.org/T332314)
[18:52:26] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: re-enable alerting on now-in-svc hosts [puppet] - 10https://gerrit.wikimedia.org/r/940240 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper)
[18:52:30] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: re-enable alerting on now-in-svc hosts [puppet] - 10https://gerrit.wikimedia.org/r/940240 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper)
[18:55:20] <icinga-wm>	 PROBLEM - glance-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:55:42] <icinga-wm>	 PROBLEM - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:56:26] <icinga-wm>	 PROBLEM - cinder-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:00:50] <icinga-wm>	 PROBLEM - Zookeeper Server on flink-zk1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[19:05:34] <icinga-wm>	 ACKNOWLEDGEMENT - cinder-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time Andrew Bogott I believe this host is offline awaiting re-racking and re-imaging. T341495 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:05:34] <icinga-wm>	 ACKNOWLEDGEMENT - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume Andrew Bogott I believe this host is offline awaiting re-racking and re-imaging. T341495 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:05:35] <icinga-wm>	 ACKNOWLEDGEMENT - glance-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time Andrew Bogott I believe this host is offline awaiting re-racking and re-imaging. T341495 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:32:28] <wikibugs>	 (03PS1) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792)
[19:32:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[19:33:36] <wikibugs>	 (03PS1) 10Jforrester: apache: Add 'view_urls' rewrite for /view URLs, enable on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/940245 (https://phabricator.wikimedia.org/T338190)
[19:33:38] <wikibugs>	 (03PS1) 10Jforrester: apache: Enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190)
[19:34:06] <wikibugs>	 (03PS2) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792)
[19:35:06] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[19:43:07] <wikibugs>	 (03PS3) 10Cory Massaro: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196
[19:43:33] <wikibugs>	 (03CR) 10Cory Massaro: Redeploy with new version of function-ochestrator. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro)
[19:45:20] <wikibugs>	 (03PS2) 10Jforrester: apache: Add 'view_urls' rewrite for /view URLs, enable on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/940245 (https://phabricator.wikimedia.org/T338190)
[19:45:22] <wikibugs>	 (03PS2) 10Jforrester: apache: Enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190)
[20:00:05] <jouncebot>	 brennen and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T2000).
[20:00:21] <TheresNoTime>	 nothing to deploy
[20:17:51] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[20:25:08] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:25:12] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[20:50:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm that step was already done on june15 see link below. so you should be good to proceed with the OS install. Thanks  https://gerrit.wikimedia.org/r/c...
[20:53:41] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro)
[21:06:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1156.eqiad.wmnet with OS bullseye
[21:07:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye
[21:10:35] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:11:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:13:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:13:25] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:17:16] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "Jan: can you backport this before you head out?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz)
[21:18:16] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@0e476e5]: Tweak Zuul status page css 🥚
[21:18:23] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@0e476e5]: Tweak Zuul status page css 🥚 (duration: 00m 07s)
[21:30:47] <wikibugs>	 (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[21:34:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1156.eqiad.wmnet with reason: host reimage
[21:38:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1156.eqiad.wmnet with reason: host reimage
[21:41:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[21:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:54:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:55:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[21:55:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1156.eqiad.wmnet with OS bullseye
[21:55:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye completed: - an-worker1156 (*...
[21:56:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[22:00:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1155.eqiad.wmnet with OS bullseye
[22:00:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye
[22:07:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:11:05] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:29:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1155.eqiad.wmnet with reason: host reimage
[22:32:31] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1155.eqiad.wmnet with reason: host reimage
[22:47:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:47:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:48:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1155.eqiad.wmnet with OS bullseye
[22:48:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye completed: - an-worker1155 (*...
[22:51:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[22:54:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1154.eqiad.wmnet with OS bullseye
[22:55:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye
[23:13:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1154.eqiad.wmnet with reason: host reimage
[23:16:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1154.eqiad.wmnet with reason: host reimage
[23:23:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:31:39] <Daimona>	 Hi folks, it looks like a patch to PrivateSettings.php got lost or something, and is now causing production errors: https://phabricator.wikimedia.org/T342405 I have no idea how this works though.
[23:32:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:33:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[23:33:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1154.eqiad.wmnet with OS bullseye
[23:33:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye completed: - an-worker1154 (*...
[23:35:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse
[23:38:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[23:41:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1153.eqiad.wmnet with OS bullseye
[23:42:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye
[23:44:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[23:49:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[23:54:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[23:59:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1153.eqiad.wmnet with reason: host reimage