[00:18:46] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:18:56] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:54] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939279 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939279 (owner: 10TrainBranchBot) [00:52:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939279 (owner: 10TrainBranchBot) [01:08:33] (03CR) 10BryanDavis: [C: 04-1] "I think that Taavi's fix at Ieb941d4b159a8dd5dfc329cf678af97d5ec85bc0 has eliminated the need for this complexity. He tested things agains" [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede) [01:09:02] (03CR) 10Tim Starling: [C: 03+2] Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938683 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester) [01:10:12] (03CR) 10Tim Starling: [C: 03+2] Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938684 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester) [01:13:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:18:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:24:37] (03Merged) 10jenkins-bot: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938683 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester) [01:25:35] (03Merged) 10jenkins-bot: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938684 (https://phabricator.wikimedia.org/T342099) (owner: 10Jforrester) [01:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:35:33] !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.17/includes/diff/DifferenceEngine.php: fix prod error T342099, T341961 (duration: 09m 20s) [01:35:39] T341961: UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T341961 [01:35:39] T342099: PHP Warning: Illegal offset type in LanguageFactory::getLanguage(StubUserLang); subsequently UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T342099 [01:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:46:20] !log tstarling@deploy1002 Synchronized php-1.41.0-wmf.18/includes/diff/DifferenceEngine.php: fix prod error T342099, T341961 (duration: 08m 32s) [01:46:25] T341961: UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T341961 [01:46:25] T342099: PHP Warning: Illegal offset type in LanguageFactory::getLanguage(StubUserLang); subsequently UnexpectedValueException: MapCacheLRU::has: invalid key; must be string or integer. - https://phabricator.wikimedia.org/T342099 [01:52:49] (03CR) 10Milimetric: [C: 03+1] Create puppet scripting for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:00] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:12] (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [04:32:31] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10Joe) 05Open→03Resolved Hi @TheDJ, what you're seeing there is a big influx of 429 from our systems rate-limiting some very aggressive api user from a public cloud. To put this in prespective, we... [04:33:33] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=parse1002.* [04:37:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Joe) 05Resolved→03Open The server went down twice in one day yesterday, see T342298. So you can sadly uncross your fingers, @akosiaris :( [04:38:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Joe) The host is again set to inactive, and still not powercycled so that any further debugging can be performed, if needed. [04:46:41] (03CR) 10Kaleem Bhatti: [C: 03+1] "anyone please merge this I don't know else I can" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [05:35:40] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:46:15] (03PS2) 10Sohom Datta: Enable EditInSequence in pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786) [05:50:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42597/console" [puppet] - 10https://gerrit.wikimedia.org/r/939700 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [05:50:34] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] services_proxy: add mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/939700 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [05:55:35] (03PS3) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) [05:56:29] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T0600). [06:29:24] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:37:16] !log start kafka main eqiad maintenance (partitions rebalancing) - T341558 [06:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:20] T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 [06:46:35] (03CR) 10Elukey: [C: 03+1] "Sorry didn't think about d-prep! We could also force TLS in there, IIRC we have PKI available, but this change is good to avoid puppet bei" [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [06:51:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:52:20] the under replicated partitions is due to the rebalance work [06:56:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:57:15] (03CR) 10JMeybohm: [C: 03+1] sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [07:00:05] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T0700). [07:00:05] _joe_ and Sohom_Datta: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] morning! we have three patches scheduled for deployment, no trainees today though two people are waiting to be scheduled. _joe_ it looks like the two patches you have on the calendar are intended to be dpeloyed in that order, correct? [07:00:33] o/ [07:00:51] Sohom_Datta: your patch will go third. there shuold be plenty of time however. [07:00:52] <_joe_> apergos: please go on with Sohom_Datta first :) and yes [07:01:05] oh. well never mind, Sohom_Datta your patch willgo first :-D [07:01:08] <_joe_> apergos: oh ok, I am usually ok being left last :) [07:01:12] Yeah sure [07:01:15] Ah okay :) [07:01:20] <_joe_> because in case I can stretch the window :) [07:01:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786) (owner: 10Sohom Datta) [07:01:44] proceeding [07:02:19] (03Merged) 10jenkins-bot: Enable EditInSequence in pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786) (owner: 10Sohom Datta) [07:02:23] there is now window schheduled after this one in the next hour, so we can go over time if needed. [07:02:55] !log ariel@deploy1002 Started scap: Backport for [[gerrit:939392|Enable EditInSequence in pawikisource]] [07:03:30] *is no [07:04:32] !log ariel@deploy1002 ariel and soda: Backport for [[gerrit:939392|Enable EditInSequence in pawikisource]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:04:59] Sohom_Datta: please test your change on mwdebug1002 [07:05:13] on it :) [07:06:04] (03CR) 10JMeybohm: [C: 04-1] modules: Add a new networkpolicy for base modules (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [07:06:49] Tested, looks good to me [07:07:19] continuing [07:11:50] (03PS2) 10Giuseppe Lavagetto: mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252) [07:11:52] (03PS3) 10Giuseppe Lavagetto: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) [07:11:54] (03PS4) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) [07:12:48] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:939392|Enable EditInSequence in pawikisource]] (duration: 09m 52s) [07:12:52] Sohom_Datta: please test your change in production. [07:13:43] Looks good, Thanks a lot for deploying :) [07:13:55] sure thing! [07:14:17] _joe_: shall I proceed with the first of your patches? [07:14:30] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:14:36] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:47] <_joe_> apergos: yes, once it's on mwdebug, I'll need a longer testing window for this one, sorry [07:14:54] <_joe_> if you prefer I can do it myself [07:14:57] that's fine, take your time [07:14:58] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:15:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:15:18] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:15:32] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:48] I should have asked if you preferred to self-deploy. too late now, heh! [07:15:51] (03Merged) 10jenkins-bot: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:16:16] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:16:19] !log ariel@deploy1002 Started scap: Backport for [[gerrit:938644|noc: add script to dump etcd db config (T341859)]] [07:16:23] T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 [07:17:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:51] !log ariel@deploy1002 oblivian and ariel: Backport for [[gerrit:938644|noc: add script to dump etcd db config (T341859)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:17:56] _joe_: plese test on mwdebug1002 and let me know when testing is complete. [07:18:55] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:18:58] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:19:00] (03CR) 10JMeybohm: [C: 03+2] CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:20:07] <_joe_> apergos: please proceed [07:20:20] continuing [07:21:26] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:22:20] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:04] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:36] RECOVERY - BFD status on cr2-drmrs is OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:24:22] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:24:26] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:25:09] <_joe_> apergos: you can move to the next patch immediately after. that's the one that might need a revert [07:25:33] no testing of the first patch in production, you mean? _joe_ [07:25:49] <_joe_> apergos: it was already done, I just scap pull the code to mwmaint :D [07:25:55] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:938644|noc: add script to dump etcd db config (T341859)]] (duration: 09m 35s) [07:25:57] tsk tsk! [07:25:59] T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 [07:25:59] <_joe_> which is where noc.w.org works [07:26:02] (03Merged) 10jenkins-bot: CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:26:03] all right, moving to the next patch [07:26:05] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:26:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:27:07] (03Merged) 10jenkins-bot: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:27:13] (03Merged) 10jenkins-bot: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:27:29] (03CR) 10JMeybohm: [C: 03+1] Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [07:27:33] !log ariel@deploy1002 Started scap: Backport for [[gerrit:938645|noc/db.php: use the new etcd fetch function (T341859)]] [07:29:05] !log ariel@deploy1002 oblivian and ariel: Backport for [[gerrit:938645|noc/db.php: use the new etcd fetch function (T341859)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:29:09] _joe_ the moment of doom has arrived, please test your changes on mwdebug1002 (if possible) [07:29:22] <_joe_> apergos: you said mwmaint1002, right? [07:29:38] I did not; you'll have to pull it there to do so [07:29:48] <_joe_> yeah :) [07:29:55] I'll wait :-P [07:29:58] <_joe_> I was just being a smartass :P [07:30:05] nothing new there :-P [07:30:38] (welcome to the ariel and _joe_ show, a special edition of the backprot-and-training window on this fine UTC morning) [07:31:02] 🍿 [07:31:13] <_joe_> apergos: go on please :) [07:31:24] continuing [07:31:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: stop using script to populate database data URIs [puppet] - 10https://gerrit.wikimedia.org/r/938818 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:36:47] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:938645|noc/db.php: use the new etcd fetch function (T341859)]] (duration: 09m 14s) [07:36:51] _joe_: your patch is now live in production, please do any additional testing that is needed. [07:36:51] T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 [07:37:04] <_joe_> apergos: thanks, all done :) [07:37:09] ok! [07:37:27] everyone here gets their remaining 23 minutes back, the window is concluded [07:37:45] !log UTC morning backport and config training window complete [07:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:47] <_joe_> if you can see https://noc.wikimedia.org/db.php my change was successful :) [07:38:11] you already know my firefox is acting up, sure you want to risk that test? :-P [07:38:19] Notice: Undefined variable: dbctlJsonByDC in /srv/mediawiki/docroot/noc/db.php on line 136 [07:38:23] Warning: Invalid argument supplied for foreach() in /srv/mediawiki/docroot/noc/db.php on line 136 [07:38:24] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:37] Notice: Undefined variable: dbConfigEtcdJsonFilename in /srv/mediawiki/docroot/noc/db.php on line 144 [07:38:37] and on [07:38:46] these seem not good, _joe_ [07:38:58] <_joe_> no it' [07:39:11] <_joe_> oh [07:39:18] <_joe_> yeah I must've missed that [07:39:21] <_joe_> the page works though :D [07:39:25] lol [07:39:40] the window can go for another 20 inutes [07:39:43] *minutes [07:39:52] shall I re-open? what would you prefer? [07:39:52] <_joe_> yeah I'll fix it and merge as soon as I'm done [07:39:58] ok [07:40:05] <_joe_> don't worry, I'll use my root privilege there [07:40:20] <_joe_> where did you see those log messages? [07:40:22] !log UTC morning backport and config training window reopened for fix to the last noc patch [07:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:32] i see them right on the page [07:40:45] I went to the link you so kindly provided and they were at the top [07:40:53] <_joe_> sigh they didn't appear on my test [07:41:10] Notice: Undefined variable: dbConfigEtcdJsonFilename in /srv/mediawiki/docroot/noc/db.php on line 145 one more message [07:41:13] so that's all 4 [07:41:39] (03PS1) 10JMeybohm: function-orchestrator: Fix service name and port for function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940087 (https://phabricator.wikimedia.org/T297314) [07:42:03] when things are fixed up again please let me know and I'll close the window again [07:42:21] and if you need anything please ping, I'll be watching here of course [07:45:01] (03CR) 10JMeybohm: [C: 03+2] function-orchestrator: Fix service name and port for function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940087 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:45:57] (03Merged) 10jenkins-bot: function-orchestrator: Fix service name and port for function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940087 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [07:46:11] (03PS1) 10Giuseppe Lavagetto: noc: fix other references to old files. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940089 [07:46:19] <_joe_> apergos: ^^ [07:46:35] self-deploy right? (I'm happy to do it though) [07:47:08] _joe_: [07:47:17] <_joe_> apergos: sure [07:47:32] ok! please add the patch to the dpeloyment calendar too for the record [07:47:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: fix other references to old files. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940089 (owner: 10Giuseppe Lavagetto) [07:48:18] <_joe_> yeah [07:48:52] (03Merged) 10jenkins-bot: noc: fix other references to old files. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940089 (owner: 10Giuseppe Lavagetto) [07:50:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:41] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [07:50:59] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [07:51:24] <_joe_> apergos: is the thing fixed for you too? [07:51:35] lookng [07:52:05] the errors on the page are gone [07:52:33] <_joe_> ack [07:53:03] <_joe_> I am not going to sync-file this, as there is no reason to cause another global restart of php-fpm [07:53:16] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:50] I wouldn't worry about it [07:54:01] the window is for up to 6 patches each with their wn php fpm restart :-P [07:54:07] *own [07:56:55] !log UTC morning backport and config training window really complete [07:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:08] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi) [07:57:24] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:15] (03PS4) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) [08:02:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:02:34] (HelmReleaseBadStatus) firing: Helm release wikifunctions/main-evaluator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:02:41] checking [08:03:36] mmhh the fpm workers for parsoid have been creeping up to the limit, from the dashboard link [08:03:56] I'm looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-2d&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus%2Fops&viewPanel=64&refresh=1m [08:04:20] I'm guessing this is the issue with parsoid "slow" since last sat, what do you think _joe_ ? [08:04:44] T342085 that is [08:04:45] T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 [08:07:24] !incidents [08:07:24] You're not allowed to perform this action. [08:07:32] sigh I forgot to ack [08:07:46] * Emperor here [08:08:03] <_joe_> godog: probably, yes [08:08:06] ah, I see [08:08:28] <_joe_> godog: we're also down another server [08:08:47] ah that would contribute for sure, thank you [08:09:00] but yeah I don't think we're in immediate danger [08:09:36] even though parsoid requests are going up [08:10:11] (03PS1) 10JMeybohm: mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) [08:10:12] hah that seems the daily cycle [08:10:19] <_joe_> yes [08:10:45] (03CR) 10CI reject: [V: 04-1] mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:11:13] ok yeah we've been hovering around the paging limit for days now, I'm looking at [08:11:16] https://prometheus-eqiad.wikimedia.org/ops/classic/graph?g0.range_input=1w&g0.expr=sum%20by(cluster%2C%20service)%20(phpfpm_statustext_processes%7Bcluster%3D~%22(api_appserver%7Cappserver%7Cparsoid)%22%2Cstate%3D%22idle%22%7D)%20%2F%20sum%20by(cluster%2C%20service)%20(phpfpm_statustext_processes%7Bcluster%3D~%22(api_appserver%7Cappserver%7Cparsoid)%22%7D)%20&g0.tab=0 [08:11:22] page is at <= 0.3 [08:12:57] now I'm wondering if we can handle parsoid traffic today with parse1002 not working and parsoid latencies high ? [08:13:52] (03PS2) 10JMeybohm: mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) [08:16:43] (03CR) 10JMeybohm: [C: 03+2] mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:17:54] (03Merged) 10jenkins-bot: mesh.certificate: Ensure the commonName is at most 64 bytes long [deployment-charts] - 10https://gerrit.wikimedia.org/r/940090 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:21:30] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [08:21:39] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [08:22:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:23:09] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-esams [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) [08:23:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:23:45] the page is bound to fire again today I think btw [08:23:56] jbond: ^ FYI [08:24:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:24:51] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [08:25:25] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [08:27:34] (HelmReleaseBadStatus) resolved: Helm release wikifunctions/main-evaluator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:27:53] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [08:28:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:28:19] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42598/console" [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [08:28:26] as expected [08:28:43] acked [08:29:01] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [08:29:19] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [08:30:19] ok so current options: tweak the threshold for parsoid only, add capacity to parsoid, disable pre-gen as suggested on T342085 (not mutually exclusive) [08:30:19] T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 [08:30:28] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-esams [puppet] - 10https://gerrit.wikimedia.org/r/940091 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [08:31:40] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940091 (T342211) to esams DC (disable keepalive on port 80) [08:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:44] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [08:33:11] hi godog sorry was afk, reading backscroll now [08:33:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:33:18] hi jbond [08:33:59] yeah nothing is on fire immediately btw, I think we should be adding more capacity to parsoid temporarily, and trying to find how to do that [08:34:23] ack [08:36:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-datasource=eqiad%20prometheus/ops&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:36:41] ok I'm silencing the alert for a while [08:36:45] +1 [08:37:33] silenced for 4h [08:37:52] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9019664, @JMeybohm wro... [08:43:16] (03CR) 10Jelto: [C: 03+2] contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [08:44:47] jelto: funnilly I just thought about that Apache change :) [08:45:36] hah :) I'll merge this now and test it, I have httpbb tests ready and integration open in the browser [08:46:40] given the config changes are probably not covered by the httpbb suite [08:51:05] merged and apache reloaded. Still looks good [08:53:15] (03PS1) 10Hashar: httpbb: test Gearman testConnection is forbidden [puppet] - 10https://gerrit.wikimedia.org/r/940092 (https://phabricator.wikimedia.org/T219991) [08:53:20] jelto: ^ :) [08:53:34] that should covers one of the rule [08:55:23] (03CR) 10Gehel: "Minor comment inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [08:56:27] (03CR) 10Jelto: [C: 03+2] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/940092 (https://phabricator.wikimedia.org/T219991) (owner: 10Hashar) [08:59:09] httpbb test look good, all pass (and one test more in total) :) [08:59:20] awesome thank you! [09:01:02] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) >>! In T341992#9029300, @RobH wrote: > Cool, I understand now. I'll move and update netbox/homer for these two hosts tomorrow to move them to 10G configured ports 44/45 I renumbered... [09:03:16] (03PS1) 10Filippo Giunchedi: conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) [09:03:56] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: auto link existing users with OIDC [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:04:06] (03CR) 10Elukey: [C: 03+1] conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi) [09:09:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You need at least to set the "cluster" hiera variable to "parsoid"." [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi) [09:12:01] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:13:17] (03PS2) 10Filippo Giunchedi: conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) [09:15:38] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/940095/42599/" [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi) [09:15:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42600/console" [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi) [09:15:50] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001" [09:17:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001" [09:17:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:17:44] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:19:40] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=mw1356.eqiad.wmnet [09:19:48] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=mw1357.eqiad.wmnet [09:21:18] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: temp add more capacity to parsoid eqiad [puppet] - 10https://gerrit.wikimedia.org/r/940095 (https://phabricator.wikimedia.org/T342085) (owner: 10Filippo Giunchedi) [09:21:55] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) [09:22:34] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) p:05Medium→03High [09:22:50] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) I deployed the change above which adds this two lines to the `gitlab.rb` file: ` gitlab_rails['omniauth_auto_link_user'] =... [09:24:08] (03PS12) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [09:24:10] (03PS6) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) [09:24:12] (03PS15) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [09:24:14] (03PS7) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [09:24:15] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001" [09:25:08] !log filippo@cumin1001 conftool action : set/weight=10; selector: name=mw1356.eqiad.wmnet [09:25:14] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw1356.eqiad.wmnet [09:25:56] !log filippo@cumin1001 conftool action : set/weight=10; selector: name=mw1357.eqiad.wmnet [09:26:01] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw1357.eqiad.wmnet [09:27:59] (03CR) 10Ilias Sarantopoulos: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [09:28:09] (03CR) 10Jbond: [C: 03+2] sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [09:29:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Jelto) [09:30:32] (03Merged) 10jenkins-bot: sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [09:35:25] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42601/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:36:39] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42603/console" [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:36:41] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42602/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [09:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:48:30] (03PS4) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [09:50:17] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove reverse dns for IP allocated in error. - cmooney@cumin1001" [09:50:17] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:40] (03PS5) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [09:56:19] (03PS1) 10Jelto: Revert "gitlab: move gitlab to test idp" [puppet] - 10https://gerrit.wikimedia.org/r/939345 [10:00:04] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1000) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1000) [10:01:25] (03CR) 10Jbond: "thanks response inline" [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:10:50] (03PS1) 10Btullis: Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) [10:15:24] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42604/console" [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [10:15:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:20:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:23:43] (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/940099 (https://phabricator.wikimedia.org/T340822) (owner: 10Klausman) [10:25:56] (03CR) 10Klausman: [C: 03+2] ml-services: Bump revertrisk-la to latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/940099 (https://phabricator.wikimedia.org/T340822) (owner: 10Klausman) [10:26:36] (03Merged) 10jenkins-bot: ml-services: Bump revertrisk-la to latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/940099 (https://phabricator.wikimedia.org/T340822) (owner: 10Klausman) [10:29:24] (03PS7) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [10:29:29] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [10:33:06] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:40:44] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:41:36] (03PS59) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [10:45:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:49:52] (03CR) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [10:50:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:53:59] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:55:03] 10SRE-swift-storage: Remove / tidy up old swiftrepl code - https://phabricator.wikimedia.org/T342334 (10MatthewVernon) [11:00:47] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42605/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [11:01:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] ceph: Add puppet management of OSDs on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [11:01:55] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) [11:03:17] (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) [11:05:17] (03PS2) 10Jbond: proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) [11:06:48] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42606/console" [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [11:07:54] (03CR) 10CI reject: [V: 04-1] proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:08:44] (03PS1) 10MVernon: Get rid of swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) [11:10:35] (03PS1) 10MVernon: Remove swiftrepl [software] - 10https://gerrit.wikimedia.org/r/940105 (https://phabricator.wikimedia.org/T342334) [11:11:46] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon) [11:14:11] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe) [11:14:49] (03PS3) 10Jbond: proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) [11:15:06] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe) [11:16:10] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe) [11:16:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42608/console" [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:16:54] (03CR) 10Vgutierrez: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [11:17:20] (03CR) 10Jbond: [C: 03+1] "not too familiar with Swift::Swiftrepl but it seems unused and the CR seems like a noop to me" [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon) [11:17:22] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42609/console" [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon) [11:17:46] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) @andrea.denisse, the new request is here: T342335 Thanks [11:18:19] (03CR) 10Jbond: [C: 03+1] "lgtm (although seems strange to not include this in 940103)" [software] - 10https://gerrit.wikimedia.org/r/940105 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon) [11:18:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:18:56] (03CR) 10Stevemunene: [C: 03+1] Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:19:35] (03PS1) 10Btullis: Fix the device name when running parted on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940109 (https://phabricator.wikimedia.org/T330151) [11:20:06] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/939283 [11:21:00] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42610/console" [puppet] - 10https://gerrit.wikimedia.org/r/940109 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [11:21:55] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the device name when running parted on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940109 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [11:22:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:28:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:29:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:34:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:35:27] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940101 (T342211) to eqsin DC (disable keepalive on port 80 on A:cp-eqsin) [11:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:31] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [11:35:40] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/940101 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [11:39:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:44:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:49:08] (03CR) 10MVernon: [C: 03+2] Get rid of swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/940103 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon) [11:49:35] (03CR) 10MVernon: [C: 03+2] Remove swiftrepl [software] - 10https://gerrit.wikimedia.org/r/940105 (https://phabricator.wikimedia.org/T342334) (owner: 10MVernon) [11:50:03] (03CR) 10Joal: "Comments about possibly updating default settings" [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:50:52] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:50:57] (03PS1) 10Dreamy Jazz: SpecialUserRights: Check for username to be temporary [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) [11:51:04] 10SRE-swift-storage, 10Patch-For-Review: Remove / tidy up old swiftrepl code - https://phabricator.wikimedia.org/T342334 (10MatthewVernon) [11:52:08] 10SRE-swift-storage, 10Patch-For-Review: Remove / tidy up old swiftrepl code - https://phabricator.wikimedia.org/T342334 (10MatthewVernon) 05Open→03Resolved [11:52:29] jouncebot: nowandnext [11:52:29] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [11:52:29] In 1 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1300) [11:52:29] In 1 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1300) [11:52:52] zabe: are you around to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/940126/ please? or should i go ahead with that? [11:52:54] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch loopbacks. - cmooney@cumin1001" [11:53:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch loopbacks. - cmooney@cumin1001" [11:53:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:53:37] (03CR) 10Urbanecm: [C: 03+2] SpecialUserRights: Check for username to be temporary [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) (owner: 10Dreamy Jazz) [11:53:42] i guess i can +2 it anyway. [11:55:51] (03PS1) 10Btullis: Do not attempt to use hdparm on nvme drives for cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940116 (https://phabricator.wikimedia.org/T330151) [11:57:29] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42611/console" [puppet] - 10https://gerrit.wikimedia.org/r/940116 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [11:58:10] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05Open→03In progress p:05Triage→03Medium [11:58:15] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [11:58:52] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) a:05Clement_Goubert→03Joe [11:59:41] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [12:01:46] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:30] urbanecm: I can do it I guess [12:06:31] zabe: feel free to finidh it; i can deploy if needed, but I'd prefer someone else doing it. Thank you! [12:07:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) (owner: 10Dreamy Jazz) [12:08:42] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) links moved, servers online for remote os installation. [12:11:18] (03CR) 10Jelto: [C: 03+1] "lgtm, but I'm not 100% sure about the service catalog change" [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [12:11:32] (03PS2) 10Btullis: Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) [12:11:46] (03Merged) 10jenkins-bot: SpecialUserRights: Check for username to be temporary [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940126 (https://phabricator.wikimedia.org/T340468) (owner: 10Dreamy Jazz) [12:11:48] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:12:13] !log zabe@deploy1002 Started scap: Backport for [[gerrit:940126|SpecialUserRights: Check for username to be temporary (T340468 T342322)]] [12:12:15] (03CR) 10Btullis: Enable local caching for presto on the test cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:12:19] T340468: Throw an error from UserGroupManager::addUserToGroup if called on a temporary user - https://phabricator.wikimedia.org/T340468 [12:12:19] T342322: Unable to use interwiki Special:UserRights - https://phabricator.wikimedia.org/T342322 [12:12:34] (03CR) 10Btullis: [C: 03+2] Enable local caching for presto on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/940097 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:13:38] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:13:49] !log zabe@deploy1002 zabe and dreamyjazz: Backport for [[gerrit:940126|SpecialUserRights: Check for username to be temporary (T340468 T342322)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:15:50] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch gw ips. - cmooney@cumin1001" [12:17:06] (03PS3) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [12:19:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:19:38] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:21] 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) All cp hosts in esams and eqsin have keep-alive disabled on port 80. Drop in number of sessions on port 80: {F37144516} {F37144519} The number of (correctly redirected) requests managed by those h... [12:20:35] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:940126|SpecialUserRights: Check for username to be temporary (T340468 T342322)]] (duration: 08m 22s) [12:20:40] T340468: Throw an error from UserGroupManager::addUserToGroup if called on a temporary user - https://phabricator.wikimedia.org/T340468 [12:20:40] T342322: Unable to use interwiki Special:UserRights - https://phabricator.wikimedia.org/T342322 [12:22:07] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for assigned switch gw ips. - cmooney@cumin1001" [12:22:07] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:23:12] (03CR) 10Cathal Mooney: [C: 03+2] admin: add Ifrah Khanyaree (WMDE) to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/938222 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney) [12:24:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:52] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [12:25:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [12:25:49] (03PS2) 10Jelto: gitlab_runner: disable unprivileged_userns [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) [12:26:08] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [12:27:59] (03CR) 10Jelto: [C: 03+2] gitlab_runner: disable unprivileged_userns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) (owner: 10Jelto) [12:30:26] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Aklapper) @Cyndymediawiksim Uhmmm. I am sorry for the hassle. [The Phabricator account now shows no authentication factors](https://phabricator.wikimedia.org/p/Cyndymediawiksim/) so it will no... [12:31:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:33:01] (03PS1) 10Cathal Mooney: Change username to match ldap one [puppet] - 10https://gerrit.wikimedia.org/r/940118 (https://phabricator.wikimedia.org/T341455) [12:36:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:39:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:41:04] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:41:17] thanks again for the deploy zabe. it seems to work now. [12:41:24] (03CR) 10Cathal Mooney: [C: 03+2] Change username to match ldap one [puppet] - 10https://gerrit.wikimedia.org/r/940118 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney) [12:42:17] (03PS3) 10Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) [12:42:22] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Reduce the frequency of flaggedrevs updates [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup) [12:43:16] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [12:43:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [12:43:47] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [12:43:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [12:44:34] !log LDAP - adding user ifrahkh to groups wmde & nda [12:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:45:29] (03PS8) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [12:46:03] (03CR) 10Urbanecm: [C: 03+2] IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [12:46:04] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:46:53] (03PS9) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [12:46:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10cmooney) @Ifrahkhanyaree you've now been added to the required LDAP groups (username ifrahkh). Please try to access the systems you need and advise if there... [12:47:02] (03CR) 10Urbanecm: [C: 03+2] IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [12:47:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:47:29] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) A basic check of the orchestra... [12:47:43] (03Merged) 10jenkins-bot: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [12:53:02] (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [12:54:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:56:42] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [12:56:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [12:59:53] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @btullis replaced cable on analytics1073 & analytics1075 [13:00:09] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [13:00:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [13:04:31] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:06:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:07:14] 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [13:08:49] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10cmassaro) That error is from our top-level, las... [13:09:16] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001" [13:09:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001" [13:10:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:11:49] (03PS6) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [13:12:42] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [13:12:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [13:13:26] (03PS1) 10Cathal Mooney: Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) [13:13:54] 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) p:05Triage→03Medium [13:14:04] (03CR) 10Vgutierrez: Remove references to releases1002/releases2002 for decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [13:14:19] (03CR) 10CI reject: [V: 04-1] Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [13:16:45] (03PS4) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [13:16:47] (03PS2) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) [13:17:10] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [13:17:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [13:18:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:18:16] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [13:18:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:21:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:21:42] (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [13:21:57] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001" [13:22:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for vlan ints lsw1-f8-eqiad - cmooney@cumin1001" [13:22:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:30] (03PS2) 10Cathal Mooney: Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) [13:24:36] (03PS1) 10JMeybohm: wikifunctions: Both charts are required to use readOnlyRootFilesystem [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) [13:25:34] (03CR) 10JMeybohm: "Feel free to deploy anytime" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [13:31:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:32:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:34:30] (03CR) 10Ayounsi: [C: 03+1] Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [13:35:25] (03CR) 10Cathal Mooney: [C: 03+2] Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [13:35:30] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) [13:37:33] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) [13:37:53] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42612/console" [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [13:39:43] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) 05Declined→03Open To upgrade the OS on CI s... [13:43:42] (03PS1) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) [13:44:06] (03CR) 10CI reject: [V: 04-1] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [13:44:36] (03CR) 10Jelto: [C: 03+1] "lgtm (this time with actually clicking on +1)" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [13:44:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye [13:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:46:37] (03PS2) 10Btullis: Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716) [13:47:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:48:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:49:13] (03CR) 10Btullis: [C: 03+2] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716) (owner: 10Btullis) [13:49:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:52:09] (03CR) 10Jelto: [V: 03+1 C: 03+1] Run LDAP group sync periodically on gitlab replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [13:52:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:53:08] (03CR) 10Eevans: [C: 03+2] cassandra: prevent malformed config when tls_cluster_name is unset [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [13:53:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:54:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:11] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [13:54:13] !log bking@cumin1001 START - Cookbook sre.dns.netbox [13:55:46] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:55:49] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [13:56:02] (03PS2) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) [13:57:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [13:57:22] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm [13:57:29] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [13:58:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:58:25] (03CR) 10Jelto: [V: 03+1 C: 03+2] Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [13:58:50] (03CR) 10CI reject: [V: 04-1] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [13:59:19] 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) > typing go completes the installation correctly correction typing go and then doing a reboot via install_console allows things to complete [13:59:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:59:53] (03PS7) 10Jelto: Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [14:01:26] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [14:01:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [14:03:45] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42613/console" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [14:04:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:04:04] (03PS3) 10Cathal Mooney: Add reverse DNS for per-rack subnets on new lsw devices [dns] - 10https://gerrit.wikimedia.org/r/940124 (https://phabricator.wikimedia.org/T334230) [14:04:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:04:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:05:19] (03PS3) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:01] (03PS5) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [14:12:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:13:28] !log dns1004 upgrade to pdns-rec 4.8.4: T341611 [14:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:31] T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 [14:13:33] (03Abandoned) 10Jforrester: [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/937972 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [14:13:58] (03CR) 10Jforrester: [C: 03+1] k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:14:59] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [14:15:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [14:15:35] (03PS1) 10Hashar: python-build: set date of source files in the wheel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 [14:15:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1075.eqiad.wmnet with OS bullseye [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:12] (03Abandoned) 10Ssingh: Revert "depool esams: router migration" [dns] - 10https://gerrit.wikimedia.org/r/938678 (owner: 10Ssingh) [14:17:37] (03PS2) 10Hashar: python-build: set date of source files in the wheel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) [14:17:42] (03CR) 10Hashar: "That is probably not a high priority but that makes it easier to review differences when refreshing dependencies of an otherwise untouched" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [14:18:44] (03CR) 10Btullis: [V: 03+1 C: 03+2] Do not attempt to use hdparm on nvme drives for cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940116 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [14:21:11] (03PS1) 10Bking: netboot.cfg: explicitly define partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/940160 (https://phabricator.wikimedia.org/T341705) [14:21:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] dnsrecursor: allow configuring the webserver loglevel [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh) [14:22:04] (03PS3) 10Hashar: python-build: set date of source files in the wheel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) [14:22:45] (03CR) 10Jelto: [V: 03+1 C: 03+2] Run LDAP group sync periodically on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [14:25:35] 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) I have rebuild sretest1002 again and things now work. I have build things a few times (logs in cumin1001:~cookbook-testing/logs) and so far it seems a bit random when... [14:25:53] (03PS1) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) [14:26:48] (03CR) 10Ssingh: [C: 03+1] "I am sure you know but you can run agent on A:installserver to make sure that the changes are picked up immediately!" [puppet] - 10https://gerrit.wikimedia.org/r/940160 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [14:27:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:27:35] (03CR) 10Bking: [C: 03+2] netboot.cfg: explicitly define partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/940160 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [14:29:59] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: allow configuring the webserver loglevel [puppet] - 10https://gerrit.wikimedia.org/r/937991 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh) [14:30:14] (03PS4) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) [14:30:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:30:41] !log disable puppet on A:dns-rec to slowly roll out CR 937991 [14:30:41] (03PS4) 10JMeybohm: k8s::apparmor: Add support for deploying apparmor profiles to k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) [14:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:07] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1075.eqiad.wmnet with reason: host reimage [14:32:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:32:43] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host flink-zk1003.eqiad.wmnet with OS bookworm [14:32:49] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [14:33:06] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:18] ^ expected [14:33:30] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:53] (03PS1) 10Jelto: gitlab: make sure ldap_group_sync_user is created first [puppet] - 10https://gerrit.wikimedia.org/r/940162 (https://phabricator.wikimedia.org/T319211) [14:34:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1075.eqiad.wmnet with reason: host reimage [14:34:32] (03CR) 10Herron: [C: 03+2] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/939326 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [14:36:13] !log run agent on cumin -b1 -s30 'A:dns-rec and not P{dns4004*}' [14:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:51] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) a:05RobH→03Vgutierrez Ready for installation! [14:37:56] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm [14:38:03] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [14:38:05] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm [14:38:11] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [14:38:58] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm [14:39:04] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [14:39:07] (03CR) 10JHathaway: puppetserver: do not notify puppetserver service on changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:40:31] (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [14:41:13] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:41:16] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:42:34] (03CR) 10Hashar: "Eventually I went to implement it using PIP_FIND_LINKS:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [14:44:18] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [14:45:35] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts analytics1073.eqiad.wmnet [14:45:41] !log roll restart codfw/eqiad low-traffic pybals to add prometheus-https T326657 [14:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [14:45:48] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Both charts are required to use readOnlyRootFilesystem [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:45:54] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts analytics1073.eqiad.wmnet [14:46:05] 10SRE, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10aborrero) a:05Jclark-ctr→03aborrero the DC-ops part is done. [14:46:38] (03Merged) 10jenkins-bot: wikifunctions: Both charts are required to use readOnlyRootFilesystem [deployment-charts] - 10https://gerrit.wikimedia.org/r/940147 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:47:22] (03CR) 10Cory Massaro: "Thank you! I'd like to run a quick test with these AppArmor profiles, one moment." [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:47:41] (03CR) 10Hashar: "That one should be straightforward. zuul-gearman.py is a script I wrote ages ago to inspect the Gearman server. I have found out there i" [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172) (owner: 10Hashar) [14:48:48] (03CR) 10Jbond: puppetserver: do not notify puppetserver service on changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:50:08] (03CR) 10Vgutierrez: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [14:50:22] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host analytics1073.eqiad.wmnet with OS bullseye [14:50:29] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye [14:51:09] (03CR) 10Ssingh: [C: 03+1] "see vg's nit but looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [14:51:13] (03CR) 10Cory Massaro: [C: 03+1] "These profiles work in my testing, so I'm happy to approve!" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:51:17] (03PS1) 10Cwhite: logstash: move labels.trace to error.stack_trace [puppet] - 10https://gerrit.wikimedia.org/r/939285 (https://phabricator.wikimedia.org/T339137) [14:51:46] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk1003.eqiad.wmnet with reason: host reimage [14:52:13] (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) [14:54:28] (03CR) 10Ssingh: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [14:54:30] (03CR) 10EoghanGaffney: [C: 03+2] Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [14:55:12] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk1003.eqiad.wmnet with reason: host reimage [14:56:13] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:56:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Reopening ticket with dell [14:56:46] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:57:13] (03CR) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [14:57:30] (03PS1) 10Jelto: gitlab: add ldap sync token [labs/private] - 10https://gerrit.wikimedia.org/r/940176 (https://phabricator.wikimedia.org/T319211) [14:57:33] (03CR) 10Fabfur: [C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/940150 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [14:58:32] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: add ldap sync token [labs/private] - 10https://gerrit.wikimedia.org/r/940176 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [14:58:36] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940150 (T342211) to ulsfo DC (disable keepalive on port 80 on A:cp-ulsfo) [14:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:39] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [14:58:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1075.eqiad.wmnet with OS bullseye [14:59:10] 10SRE, 10Traffic, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [14:59:52] (03CR) 10Jelto: [V: 03+2 C: 03+2] gitlab: add ldap sync token [labs/private] - 10https://gerrit.wikimedia.org/r/940176 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [15:02:23] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) 05Open→03Resolved Many thanks to all concerned. These hosts now have regained connectivity and have been upgraded to 10 G... [15:02:53] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: make sure ldap_group_sync_user is created first (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940162 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [15:02:58] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:47] (03PS1) 10Btullis: Stop repeatedly disabling the write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940178 (https://phabricator.wikimedia.org/T330151) [15:04:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:05:03] (03CR) 10Btullis: [C: 03+2] Stop repeatedly disabling the write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940178 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [15:05:10] (03PS1) 10Jelto: gitlab: move gitlab::ldap_group_sync_bot_token to private puppet [puppet] - 10https://gerrit.wikimedia.org/r/940179 (https://phabricator.wikimedia.org/T319211) [15:05:12] (03PS6) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [15:05:23] 10SRE, 10SRE-Access-Requests: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) a:03andrea.denisse [15:05:38] (03PS1) 10Bking: wdqs: fix missing entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940180 (https://phabricator.wikimedia.org/T332314) [15:05:48] (03PS1) 10Cathal Mooney: Add static yaml data for new eqiad leaf devices [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) [15:06:47] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bullseye [15:06:50] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42615/console" [puppet] - 10https://gerrit.wikimedia.org/r/940179 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [15:06:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [15:07:21] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [15:07:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [15:07:52] (03CR) 10Jelto: [C: 03+2] gitlab: make sure ldap_group_sync_user is created first [puppet] - 10https://gerrit.wikimedia.org/r/940162 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [15:08:06] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage [15:08:19] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: move gitlab::ldap_group_sync_bot_token to private puppet [puppet] - 10https://gerrit.wikimedia.org/r/940179 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [15:11:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage [15:12:25] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:51] (03CR) 10DCausse: [C: 03+1] wdqs: fix missing entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940180 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking) [15:13:14] (03CR) 10Bking: [C: 03+2] wdqs: fix missing entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940180 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking) [15:14:51] 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [15:15:49] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:10] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk1003.eqiad.wmnet with OS bookworm [15:16:16] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm completed:... [15:17:56] (03CR) 10JMeybohm: [C: 03+1] mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [15:18:00] (03CR) 10JMeybohm: [C: 03+1] mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [15:18:15] (03PS1) 10Giuseppe Lavagetto: kubernetes: add mw-misc "service" [puppet] - 10https://gerrit.wikimedia.org/r/940186 (https://phabricator.wikimedia.org/T341859) [15:19:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:20:24] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [15:20:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [15:28:13] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:19] ^^ rsync: failed to connect to releases1003.eqiad.wmnet: Connection timed out (110) [15:30:28] it runs every 10 minutes so I guess that will self recover [15:30:38] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:15] !log stop kafka main eqiad maintenance - T341558 [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:20] T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 [15:31:46] (03PS1) 10Giuseppe Lavagetto: mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) [15:34:31] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-codfw [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) [15:36:40] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42616/console" [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [15:45:00] (03PS7) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [15:46:21] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [15:46:27] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:46:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [15:46:40] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:48:04] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [15:48:09] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [15:48:50] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [15:49:26] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [15:50:00] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [15:50:01] (03CR) 10Ayounsi: [C: 03+1] Add static yaml data for new eqiad leaf devices [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [15:51:00] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:51:01] (03PS1) 10Btullis: Run the ceph osd execs with the shell provider [puppet] - 10https://gerrit.wikimedia.org/r/940192 (https://phabricator.wikimedia.org/T330151) [15:51:28] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:51:59] (03PS2) 10Btullis: Run the ceph osd execs with the shell provider [puppet] - 10https://gerrit.wikimedia.org/r/940192 (https://phabricator.wikimedia.org/T330151) [15:54:06] 10SRE-tools, 10Infrastructure-Foundations: Cumin fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) i wonder iff this is because im running this over and over again in quick succession and possibly fingerprint checking is getting in the way as i get the following from... [15:59:59] PROBLEM - Host elastic2086 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:17] RECOVERY - Host elastic2086 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [16:00:48] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol1005: add role with new domain [puppet] - 10https://gerrit.wikimedia.org/r/940194 (https://phabricator.wikimedia.org/T341495) [16:00:57] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:14] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [16:02:17] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [16:03:32] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005 [16:03:57] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcontrol1005 [16:04:24] (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:12] (SystemdUnitFailed) resolved: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [16:07:14] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:16] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:11] (03CR) 10Btullis: [C: 03+2] Run the ceph osd execs with the shell provider [puppet] - 10https://gerrit.wikimedia.org/r/940192 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:13:01] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye [16:13:29] (ProbeDown) firing: Service vrts2001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts2001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:54] PROBLEM - Check systemd state on vrts2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cron.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye [16:15:46] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:02] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage [16:18:16] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [16:18:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: host reimage [16:20:20] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005 - aborrero@cumin1001" [16:21:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005 - aborrero@cumin1001" [16:21:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bullseye [16:21:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [16:21:50] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005 [16:22:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1005 [16:25:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol1005: add role with new domain [puppet] - 10https://gerrit.wikimedia.org/r/940194 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [16:27:14] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:36] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:29] (03PS1) 10Cory Massaro: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 [16:30:02] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:08] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:13] (03PS1) 10Arturo Borrero Gonzalez: eqiad1: cloudcontrol1005: load cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/940197 (https://phabricator.wikimedia.org/T341495) [16:30:20] (03PS2) 10Cory Massaro: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 [16:31:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] eqiad1: cloudcontrol1005: load cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/940197 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [16:31:47] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bullseye [16:34:32] (03PS1) 10Giuseppe Lavagetto: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) [16:34:34] (03PS1) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 [16:36:10] (03PS8) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [16:36:12] (03PS3) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) [16:37:29] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1005.eqiad.wmnet with OS bullseye [16:37:42] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bullseye [16:38:22] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [16:38:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [16:40:55] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [16:41:01] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1013.eqiad.wmnet with OS bookworm [16:42:31] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1001.eqiad.wmnet with OS bookworm [16:42:57] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm [16:43:01] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bookworm [16:44:09] 10SRE, 10SRE-Access-Requests: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10nskaggs) Taavi is a valued contributor to Wikimedia and it's projects for over 4 years, and is the current Tech contributor of the year for Wikimedia. He possesses both the requisite skills and knowl... [16:46:49] (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [16:47:02] (03CR) 10Ssingh: [C: 03+1] haproxy: Add option to disable keepalive on port 80 on A:cp-codfw [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [16:47:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1073.eqiad.wmnet with OS bullseye [16:48:02] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-codfw [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [16:48:58] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940190 (T342211) to codfw DC (disable keepalive on port 80 on A:cp-codfw) [16:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:02] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [16:49:32] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts analytics1075.eqiad.wmnet [16:49:48] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts analytics1075.eqiad.wmnet [16:49:59] 10SRE, 10Traffic, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [16:51:48] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [16:52:15] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [16:52:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [16:53:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [16:54:47] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10BTullis) I don't know if you want a new ticket for this, but I'm seeing errors from the `upgrade-firmware` cookbook when run against some hosts with the old... [16:55:12] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [16:56:51] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk1001.eqiad.wmnet with reason: host reimage [16:57:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1013.eqiad.wmnet with reason: host reimage [16:59:51] (03Abandoned) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:00:06] bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1700) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1700) [17:00:14] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk1001.eqiad.wmnet with reason: host reimage [17:03:17] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/940201 [17:03:28] (03Restored) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:05:10] (03PS1) 10Dwisehaupt: Remove frav1002 monitoring, add it for frav1003 [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) [17:07:22] (03PS9) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [17:07:33] (03CR) 10Jgreen: [C: 03+2] Remove frav1002 monitoring, add it for frav1003 [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt) [17:08:34] (03CR) 10Jgreen: [C: 03+1] "Looks good to me, ready to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt) [17:09:02] (03PS4) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) [17:09:50] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [17:09:55] * bd808 will not be trying to deploy anything today [17:10:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [17:12:15] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [17:18:04] PROBLEM - Zookeeper Server on flink-zk1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [17:20:39] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10Papaul) @BTullis the firmware cookbook will failed if the IDRAC version is too old. You need to get it to a minimum version for the API to work. [17:21:03] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [17:21:08] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS bookworm [17:21:50] (03PS1) 10Jbond: ssh::publich_fingrprints: also link .ecdsa file [puppet] - 10https://gerrit.wikimedia.org/r/940227 [17:22:13] (03CR) 10Jbond: [C: 03+2] ssh::publich_fingrprints: also link .ecdsa file [puppet] - 10https://gerrit.wikimedia.org/r/940227 (owner: 10Jbond) [17:22:15] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk1001.eqiad.wmnet with OS bookworm [17:22:21] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm completed:... [17:22:53] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [17:23:09] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) [17:23:17] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10RobH) [17:25:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [17:25:23] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1002.eqiad.wmnet with OS bookworm [17:25:30] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm [17:28:44] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10RobH) [17:28:52] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10RobH) [17:36:40] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on flink-zk1002.eqiad.wmnet with reason: host reimage [17:39:46] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on flink-zk1002.eqiad.wmnet with reason: host reimage [17:41:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bullseye [17:41:16] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [17:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:55:52] (03CR) 10Jforrester: "No change to image value?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro) [18:00:04] dancy and dduvall: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T1800). [18:00:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host flink-zk1002.eqiad.wmnet with OS bookworm [18:00:25] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm completed:... [18:05:38] (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [18:08:06] 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) `pdns-recursor 4.8.4-1+wmf11u1` has been running in production on the following hosts for a while: dns1004, 2004, 4003, 5003 doh6001 No issues observed, so we will rolling out to all hosts that use it on... [18:08:26] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [18:11:21] 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) [18:11:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Papaul finished double checking that I got everything like we discussed. All the firmware is up to date and the NIC issues have been solved. Can you pleas... [18:15:08] The train is blocked on https://phabricator.wikimedia.org/T342282. Please help to unblock if you can! [18:37:13] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940238 (https://phabricator.wikimedia.org/T340246) [18:37:15] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940238 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [18:37:15] Unblocked! [18:37:55] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940238 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [18:38:02] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:38:05] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [18:38:11] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:41:06] !log bking@cumin1001 conftool action : set/pooled=yes,set/weight=10; selector: name=wdqs2013-19.codfw.wmnet [18:42:46] (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [18:43:11] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2013.codfw.wmnet [18:43:11] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2014.codfw.wmnet [18:43:12] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2015.codfw.wmnet [18:43:19] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2016.codfw.wmnet [18:43:28] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2017.codfw.wmnet [18:43:35] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2018.codfw.wmnet [18:43:38] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2019.codfw.wmnet [18:43:52] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs20{20}.codfw.wmnet [18:44:33] !log bking@cumin1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2020.codfw.wmnet [18:44:42] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.18 refs T340246 [18:44:46] T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246 [18:45:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:46:52] inflatador: fyi in case it makes your life easier next time, it takes a regex, so wdqs20(1[3-9]|20)\.codfw\.wmnet :) [18:49:19] rzl thanks, I made a few feeble attempts using cumin syntax to no avail ;) [18:50:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:51:28] (03PS1) 10Ryan Kemper: wdqs: re-enable alerting on now-in-svc hosts [puppet] - 10https://gerrit.wikimedia.org/r/940240 (https://phabricator.wikimedia.org/T332314) [18:52:26] (03CR) 10Bking: [C: 03+1] wdqs: re-enable alerting on now-in-svc hosts [puppet] - 10https://gerrit.wikimedia.org/r/940240 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper) [18:52:30] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: re-enable alerting on now-in-svc hosts [puppet] - 10https://gerrit.wikimedia.org/r/940240 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper) [18:55:20] PROBLEM - glance-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:55:42] PROBLEM - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:56:26] PROBLEM - cinder-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:50] PROBLEM - Zookeeper Server on flink-zk1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [19:05:34] ACKNOWLEDGEMENT - cinder-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time Andrew Bogott I believe this host is offline awaiting re-racking and re-imaging. T341495 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:05:34] ACKNOWLEDGEMENT - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume Andrew Bogott I believe this host is offline awaiting re-racking and re-imaging. T341495 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:05:35] ACKNOWLEDGEMENT - glance-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time Andrew Bogott I believe this host is offline awaiting re-racking and re-imaging. T341495 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:32:28] (03PS1) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [19:32:51] (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [19:33:36] (03PS1) 10Jforrester: apache: Add 'view_urls' rewrite for /view URLs, enable on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/940245 (https://phabricator.wikimedia.org/T338190) [19:33:38] (03PS1) 10Jforrester: apache: Enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190) [19:34:06] (03PS2) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [19:35:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [19:43:07] (03PS3) 10Cory Massaro: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 [19:43:33] (03CR) 10Cory Massaro: Redeploy with new version of function-ochestrator. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro) [19:45:20] (03PS2) 10Jforrester: apache: Add 'view_urls' rewrite for /view URLs, enable on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/940245 (https://phabricator.wikimedia.org/T338190) [19:45:22] (03PS2) 10Jforrester: apache: Enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190) [20:00:05] brennen and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230720T2000). [20:00:21] nothing to deploy [20:17:51] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:25:08] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [20:25:12] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:50:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm that step was already done on june15 see link below. so you should be good to proceed with the OS install. Thanks https://gerrit.wikimedia.org/r/c... [20:53:41] (03CR) 10Jforrester: [C: 03+1] Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro) [21:06:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1156.eqiad.wmnet with OS bullseye [21:07:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye [21:10:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:11:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:17:16] (03CR) 10Jdlrobson: [C: 03+1] "Jan: can you backport this before you head out?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [21:18:16] !log hashar@deploy1002 Started deploy [integration/docroot@0e476e5]: Tweak Zuul status page css 🥚 [21:18:23] !log hashar@deploy1002 Finished deploy [integration/docroot@0e476e5]: Tweak Zuul status page css 🥚 (duration: 00m 07s) [21:30:47] (03CR) 10Cwhite: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [21:34:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1156.eqiad.wmnet with reason: host reimage [21:38:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1156.eqiad.wmnet with reason: host reimage [21:41:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [21:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:54:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:55:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:55:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1156.eqiad.wmnet with OS bullseye [21:55:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye completed: - an-worker1156 (*... [21:56:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [22:00:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1155.eqiad.wmnet with OS bullseye [22:00:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye [22:07:55] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:05] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1155.eqiad.wmnet with reason: host reimage [22:32:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1155.eqiad.wmnet with reason: host reimage [22:47:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:47:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:48:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1155.eqiad.wmnet with OS bullseye [22:48:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye completed: - an-worker1155 (*... [22:51:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [22:54:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1154.eqiad.wmnet with OS bullseye [22:55:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye [23:13:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1154.eqiad.wmnet with reason: host reimage [23:16:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1154.eqiad.wmnet with reason: host reimage [23:23:53] (03CR) 10Cwhite: [C: 03+2] logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:31:39] Hi folks, it looks like a patch to PrivateSettings.php got lost or something, and is now causing production errors: https://phabricator.wikimedia.org/T342405 I have no idea how this works though. [23:32:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:33:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:33:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1154.eqiad.wmnet with OS bullseye [23:33:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye completed: - an-worker1154 (*... [23:35:19] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [23:38:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [23:41:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1153.eqiad.wmnet with OS bullseye [23:42:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye [23:44:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:49:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:54:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:59:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1153.eqiad.wmnet with reason: host reimage