[00:21:07] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965574 [00:38:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965574 (owner: 10TrainBranchBot) [00:44:39] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:13] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965574 (owner: 10TrainBranchBot) [01:04:55] ^ false positive, "Permission denied (13)" - but I don't have the rights to give it another +2 to trigger Jenkins [02:00:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:16] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:56:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:59:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [03:03:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:56] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [03:54:57] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:03:49] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:39:01] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 147 probes of 714 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:44:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 714 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:50:57] (03CR) 10Stevemunene: [C: 03+1] Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [04:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 7.2891647792885115s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:08:03] (03PS4) 10KartikMistry: Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) [05:17:56] (03PS1) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for articlequality production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965933 (https://phabricator.wikimedia.org/T348265) [05:20:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:02] (03PS2) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for articlequality production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965933 (https://phabricator.wikimedia.org/T348265) [05:31:27] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [05:31:47] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [05:32:06] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [05:32:47] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [05:32:54] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [05:32:58] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [05:33:25] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [05:33:35] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [05:33:43] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [05:34:16] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:35:19] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:36:07] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:38:47] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [05:39:42] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [05:40:27] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [05:40:56] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [05:41:14] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [05:41:29] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [05:55:43] (03CR) 10Elukey: [C: 03+1] ml-services: enable multiprocessing for articlequality production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965933 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [05:58:28] (03PS3) 10Elukey: role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) [06:24:13] (03PS1) 10Ilias Sarantopoulos: httpbb(ml-services): add test for langid model [puppet] - 10https://gerrit.wikimedia.org/r/965936 (https://phabricator.wikimedia.org/T340507) [06:26:09] (03CR) 10Elukey: httpbb(ml-services): add test for langid model (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965936 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [06:26:35] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:59] (03PS2) 10Ilias Sarantopoulos: httpbb: add test for langid model [puppet] - 10https://gerrit.wikimedia.org/r/965936 (https://phabricator.wikimedia.org/T340507) [06:29:09] (03CR) 10Ilias Sarantopoulos: httpbb: add test for langid model (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965936 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [06:31:11] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:51] (03CR) 10Elukey: [C: 03+2] httpbb: add test for langid model [puppet] - 10https://gerrit.wikimedia.org/r/965936 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [06:34:43] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: enable multiprocessing for articlequality production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965933 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [06:35:44] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing for articlequality production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965933 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [06:38:51] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10elukey) Eventstreams has been ported to nodejs18, the last LTS. I am working on doing the same for Change Propagation (with the colla... [06:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.2579252095460145s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:39:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.092743701253342s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:42:16] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:44:31] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 3.5468473975484724s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:51:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T343198)', diff saved to https://phabricator.wikimedia.org/P52964 and previous config saved to /var/cache/conftool/dbconfig/20231016-065134-arnaudb.json [06:51:39] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [06:53:40] (03CR) 10Jelto: [C: 03+2] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965755 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [06:54:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:54:48] (03Merged) 10jenkins-bot: miscweb: update research-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/965755 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P52965 and previous config saved to /var/cache/conftool/dbconfig/20231016-070640-arnaudb.json [07:12:41] !log aqu@deploy2002 Started deploy [analytics/refinery@1baf3be] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1baf3be2] [07:15:33] !log aqu@deploy2002 Finished deploy [analytics/refinery@1baf3be] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1baf3be2] (duration: 02m 51s) [07:17:46] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync [07:17:59] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [07:21:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P52966 and previous config saved to /var/cache/conftool/dbconfig/20231016-072147-arnaudb.json [07:27:07] good morning [07:27:36] there were no patches scheduled for this morning backport window [07:28:03] I am doing https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/965220/ as a follow up to last week train [07:28:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965220 (https://phabricator.wikimedia.org/T348689) (owner: 10Jforrester) [07:36:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T343198)', diff saved to https://phabricator.wikimedia.org/P52967 and previous config saved to /var/cache/conftool/dbconfig/20231016-073653-arnaudb.json [07:36:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [07:36:58] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [07:37:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [07:37:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:37:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:37:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T343198)', diff saved to https://phabricator.wikimedia.org/P52968 and previous config saved to /var/cache/conftool/dbconfig/20231016-073731-arnaudb.json [07:41:23] (03PS2) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from bootstrap hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/965164 (https://phabricator.wikimedia.org/T336044) [07:43:00] (03Merged) 10jenkins-bot: Don't try to lock to serialize m3u8 file writes [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965220 (https://phabricator.wikimedia.org/T348689) (owner: 10Jforrester) [07:43:20] !log hashar@deploy2002 Started scap: Backport for [[gerrit:965220|Don't try to lock to serialize m3u8 file writes (T348689 T348667 T348375 T348753)]] [07:43:28] T348689: WebVideoTranscode::updateStreamingManifests: Cannot flush pre-lock snapshot - https://phabricator.wikimedia.org/T348689 [07:43:29] T348375: Commons removal of last remaining caption: Caught exception of type Wikimedia\Rdbms\DBUnexpectedError - https://phabricator.wikimedia.org/T348375 [07:43:29] T348753: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 [07:43:29] T348667: DBUnexpectedError while deleting or moving file - https://phabricator.wikimedia.org/T348667 [07:53:18] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi) Maybe prepending the AS on the backup LVS is easier to do than expected? i though PyBal's development had stopped, but seeing @Vgutierrez 's [... [07:53:47] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [07:54:15] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [07:54:37] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [07:55:05] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [07:57:35] !log hashar@deploy2002 jforrester and hashar: Backport for [[gerrit:965220|Don't try to lock to serialize m3u8 file writes (T348689 T348667 T348375 T348753)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:57:48] T348689: WebVideoTranscode::updateStreamingManifests: Cannot flush pre-lock snapshot - https://phabricator.wikimedia.org/T348689 [07:57:48] T348375: Commons removal of last remaining caption: Caught exception of type Wikimedia\Rdbms\DBUnexpectedError - https://phabricator.wikimedia.org/T348375 [07:57:49] T348753: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 [07:57:49] T348667: DBUnexpectedError while deleting or moving file - https://phabricator.wikimedia.org/T348667 [07:58:46] !log hashar@deploy2002 jforrester and hashar: Continuing with sync [08:01:59] (03CR) 10Ayounsi: "Thanks for the patch, I don't have enough Pybal dev knowledge but setting myself as CC to follow along." [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [08:05:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:25] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:965220|Don't try to lock to serialize m3u8 file writes (T348689 T348667 T348375 T348753)]] (duration: 27m 04s) [08:10:33] T348689: WebVideoTranscode::updateStreamingManifests: Cannot flush pre-lock snapshot - https://phabricator.wikimedia.org/T348689 [08:10:33] T348375: Commons removal of last remaining caption: Caught exception of type Wikimedia\Rdbms\DBUnexpectedError - https://phabricator.wikimedia.org/T348375 [08:10:34] T348753: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 [08:10:34] T348667: DBUnexpectedError while deleting or moving file - https://phabricator.wikimedia.org/T348667 [08:11:58] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: add redis exporter and prom scrape config [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [08:12:40] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/965787 (owner: 10Brouberol) [08:13:58] (03CR) 10Btullis: [C: 03+1] Remove kafka-jumbo100[1-6] brokers from bootstrap hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/965164 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:13:59] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Make it easier to use topicmappr reblance [puppet] - 10https://gerrit.wikimedia.org/r/965787 (owner: 10Brouberol) [08:14:27] (03CR) 10Brouberol: [C: 03+2] Remove kafka-jumbo100[1-6] brokers from bootstrap hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/965164 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [08:15:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:35] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10Vgutierrez) >>! In T348446#9252677, @ayounsi wrote: > Maybe prepending the AS on the backup LVS is easier to do than expected? > i though PyBal's devel... [08:27:11] (03PS1) 10Slyngshede: Implement JavaScript password match check. [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 [08:27:47] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:53] (03CR) 10Volans: Implement JavaScript password match check. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 (owner: 10Slyngshede) [08:34:20] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:34:32] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [08:34:44] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:34:46] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [08:35:05] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:35:09] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [08:35:26] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [08:35:52] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [08:36:46] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [08:38:31] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [08:38:51] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [08:39:30] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [08:40:01] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [08:40:41] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [08:41:19] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [08:42:54] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [08:43:09] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [08:43:48] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [08:44:18] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [08:44:36] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [08:44:59] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [08:46:47] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1001.eqiad.wmnet [08:48:53] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:58] !log brouberol@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:49:35] (03CR) 10Filippo Giunchedi: "Nice! Tested in Pontoon and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965785 (https://phabricator.wikimedia.org/T321579) (owner: 10Herron) [08:51:32] !log brouberol@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [08:52:55] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [08:56:24] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [08:59:31] (03PS1) 10JMeybohm: Rebuild images to pick up the fix for CVE-2023-4911 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/966129 (https://phabricator.wikimedia.org/T348647) [09:00:32] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-test-master1001.eqiad.wmnet [09:01:38] (03PS2) 10JMeybohm: Rebuild images to pick up the fix for CVE-2023-4911 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/966129 (https://phabricator.wikimedia.org/T348647) [09:02:15] (03PS4) 10Majavah: security: use concat to construct access.conf [puppet] - 10https://gerrit.wikimedia.org/r/965461 [09:02:19] (03PS2) 10Slyngshede: Implement JavaScript password match check. [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 [09:02:31] (03CR) 10Majavah: security: use concat to construct access.conf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [09:02:44] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Rebuild images to pick up the fix for CVE-2023-4911 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/966129 (https://phabricator.wikimedia.org/T348647) (owner: 10JMeybohm) [09:04:51] (03PS3) 10Slyngshede: Implement JavaScript password match check. [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 [09:05:00] (03CR) 10Slyngshede: Implement JavaScript password match check. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/966123 (owner: 10Slyngshede) [09:06:32] 10SRE-swift-storage, 10TimedMediaHandler, 10MW-1.41-notes (1.41.0-wmf.30; 2023-10-10), 10MW-1.42-notes (1.42.0-wmf.1; 2023-10-17), and 2 others: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.w... - https://phabricator.wikimedia.org/T348753 [09:11:03] Hello, friends. I'm not entirely sure why this has just started happening but there appears to be an analytics instrument sending ~17,000 invalid events per minute to EventGate. I'm looking at the instrument now to see if I can see what's wrong. I might just have to disable it and notify the team though [09:11:44] brouberol, elukey ^^^ [09:11:58] I know you're taking in the other channel about something potentially related [09:12:07] phuedx: o/ [09:12:15] I posted a msg in https://phabricator.wikimedia.org/T346106#9252934 [09:12:25] and I am chatting with Olga to see how to fix [09:12:27] yes, see #wikimedia-analytics. A redeploy triggered a schema validation issue [09:12:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:50] phuedx: we could revert the change and restart eventgate, in theory [09:13:11] elukey: I'll join in -analytics to avoid fragmenting the conversation [09:15:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:23] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:17:23] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:43] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:57] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:23:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:05] 10SRE, 10DBA, 10MW-1.41-notes (1.41.0-wmf.30; 2023-10-10): Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [09:25:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:16] hashar: o/ [09:29:38] do you have a minute? We found a problem with some eventgate vaidation errors, and we think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/965900 should solve [09:29:45] see https://phabricator.wikimedia.org/T346106#9252934 [09:30:05] (03PS1) 10JMeybohm: Update rsyslog common image version [puppet] - 10https://gerrit.wikimedia.org/r/966137 (https://phabricator.wikimedia.org/T348647) [09:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:26] I am very rusty in mw deployments but I'd just to scap backport 965900 [09:31:22] jouncebot: next [09:31:22] In 0 hour(s) and 28 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1000) [09:32:56] elukey: I can backport it. I'm a little rusty too. To put in perspective how long it's been since I've done a deployment, I just had to verify the fingerprint of the deployment server [09:33:06] lol [09:33:13] ok so scap backport #change should do it [09:33:27] (03PS1) 10Slyngshede: data.yaml: Remove dublicate SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/966139 [09:33:37] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:48] going to ping other SRE folks [09:34:07] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:11] (03CR) 10Volans: [C: 03+1] "LGTM, please make sure to contact Jeff too to let him know." [puppet] - 10https://gerrit.wikimedia.org/r/966139 (owner: 10Slyngshede) [09:36:57] (03PS1) 10Ladsgroup: Enable pagelinks migration WRITE BOTH on some s2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966140 (https://phabricator.wikimedia.org/T345732) [09:37:36] phuedx: so far no opposition, you can proceed in my opinion [09:37:47] elukey: OK [09:38:53] (03PS1) 10Phuedx: Revert "Introduce Web Accessibility Features and Submodule" [extensions/WikimediaEvents] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965901 [09:38:58] !log phuedx@deploy2002 backport Cancelled [09:39:10] It needs to be cherry picked [09:39:13] (03CR) 10Slyngshede: [C: 03+2] data.yaml: Remove dublicate SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/966139 (owner: 10Slyngshede) [09:39:19] lovely [09:39:20] (03PS1) 10JMeybohm: Bump image versions to pick up the fix for CVE-2023-4911 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966141 (https://phabricator.wikimedia.org/T348647) [09:40:02] OK. scap backport 965901 appears to be working [09:40:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965901 (owner: 10Phuedx) [09:40:09] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:40:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:23] let me know once you're done, I have some changes to deploy :D [09:40:43] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:41:27] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:41:33] (03CR) 10JMeybohm: [C: 03+2] Update rsyslog common image version [puppet] - 10https://gerrit.wikimedia.org/r/966137 (https://phabricator.wikimedia.org/T348647) (owner: 10JMeybohm) [09:41:38] Amir1: ACK [09:41:54] thanks! [09:41:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:08] (03Merged) 10jenkins-bot: Revert "Introduce Web Accessibility Features and Submodule" [extensions/WikimediaEvents] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965901 (owner: 10Phuedx) [09:42:23] !log phuedx@deploy2002 Started scap: Backport for [[gerrit:965901|Revert "Introduce Web Accessibility Features and Submodule"]] [09:43:26] (03CR) 10JMeybohm: [C: 03+2] Bump image versions to pick up the fix for CVE-2023-4911 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966141 (https://phabricator.wikimedia.org/T348647) (owner: 10JMeybohm) [09:43:45] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:965901|Revert "Introduce Web Accessibility Features and Submodule"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:43:46] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10ayounsi) >>! In T348837#9250190, @jhathaway wrote: >[...] Evidently UDP encapsulation may have performance benefits because routers are tuned to support it [...] At the endpoints,... [09:45:00] (03PS1) 10Ilias Sarantopoulos: ml-services: disable mp for inference in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/966142 (https://phabricator.wikimedia.org/T348265) [09:46:07] (03Merged) 10jenkins-bot: Bump image versions to pick up the fix for CVE-2023-4911 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966141 (https://phabricator.wikimedia.org/T348647) (owner: 10JMeybohm) [09:46:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:34] (03CR) 10Elukey: [C: 03+1] ml-services: disable mp for inference in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/966142 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [09:46:57] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: disable mp for inference in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/966142 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [09:47:03] Tested on mwdebug2001. I saw an event logged to the eventlogging_DesktopWebUITracking stream with the appropriate properties. It was validated and accepted by EventGate [09:47:10] !log phuedx@deploy2002 phuedx: Continuing with sync [09:47:50] (03Merged) 10jenkins-bot: ml-services: disable mp for inference in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/966142 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [09:51:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:51:43] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:52:07] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:52:27] !log phuedx@deploy2002 Finished scap: Backport for [[gerrit:965901|Revert "Introduce Web Accessibility Features and Submodule"]] (duration: 10m 04s) [09:52:28] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:52:41] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:01] Amir1: Done [09:53:20] (03PS2) 10Ladsgroup: Enable pagelinks migration WRITE BOTH on some more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966140 (https://phabricator.wikimedia.org/T345732) [09:53:23] awesome. Thanks! [09:54:13] (03PS3) 10Ladsgroup: Enable pagelinks migration WRITE BOTH on some more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966140 (https://phabricator.wikimedia.org/T345732) [09:55:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:29] (03CR) 10Ladsgroup: [C: 03+2] Enable pagelinks migration WRITE BOTH on some more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966140 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [09:57:13] (03Merged) 10jenkins-bot: Enable pagelinks migration WRITE BOTH on some more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966140 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [09:57:31] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:966140|Enable pagelinks migration WRITE BOTH on some more wikis (T345732)]] [09:57:35] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [09:58:43] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:966140|Enable pagelinks migration WRITE BOTH on some more wikis (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1000) [10:01:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:28] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:02:11] (03CR) 10Gmodena: "This change is ready for review." [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) (owner: 10Gmodena) [10:04:32] (03PS1) 10Hnowlan: rest-gateway: route API specs for AQS2 services [deployment-charts] - 10https://gerrit.wikimedia.org/r/966148 (https://phabricator.wikimedia.org/T343268) [10:05:32] (03PS1) 10Ladsgroup: Change default of pagelinks to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966149 (https://phabricator.wikimedia.org/T345732) [10:06:51] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:966140|Enable pagelinks migration WRITE BOTH on some more wikis (T345732)]] (duration: 09m 19s) [10:06:58] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:08:48] (03CR) 10Ladsgroup: [C: 03+2] Change default of pagelinks to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966149 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:10:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966149 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:10:06] (03Merged) 10jenkins-bot: Change default of pagelinks to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966149 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:10:22] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:966149|Change default of pagelinks to write both (T345732)]] [10:11:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:43] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:966149|Change default of pagelinks to write both (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:12:57] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:13:06] (03CR) 10Klausman: [C: 03+1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [10:13:28] (03PS1) 10JMeybohm: docker-report: Add exclude action to filter rule [puppet] - 10https://gerrit.wikimedia.org/r/966151 (https://phabricator.wikimedia.org/T348876) [10:14:12] (03CR) 10Jcrespo: [C: 03+1] "Ok to merge, do you want me to do it or you do it? No blockers on my side anyway." [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894) (owner: 10FNegri) [10:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:15] (03CR) 10FNegri: Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894) (owner: 10FNegri) [10:17:20] (03CR) 10FNegri: [C: 03+2] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894) (owner: 10FNegri) [10:17:59] (03CR) 10Elukey: "There is a comment that is unresolved, then we should be good to go!" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [10:18:06] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:966149|Change default of pagelinks to write both (T345732)]] (duration: 07m 44s) [10:18:11] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [10:18:27] (03CR) 10Jcrespo: [C: 03+1] "Very ok to me to merge. I would like to know more the context so I make sure to add it in further emails messages beyond the RFC, will thi" [puppet] - 10https://gerrit.wikimedia.org/r/962940 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [10:19:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) (owner: 10Gmodena) [10:20:19] (03CR) 10Volans: "some minor comments inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [10:22:09] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [10:22:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:12] (03CR) 10Jcrespo: [C: 03+1] "If you have tested that the effective firewall config is basically the same, ok with it (let's deploy together and slowly + let's test bac" [puppet] - 10https://gerrit.wikimedia.org/r/965656 (owner: 10Muehlenhoff) [10:25:02] (03CR) 10Marostegui: "Nothing to be done on dbctl level right? I don't think so, but just confirming" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [10:25:41] (03CR) 10Jcrespo: [C: 03+1] "Same answer than at https://gerrit.wikimedia.org/r/c/operations/puppet/+/965656" [puppet] - 10https://gerrit.wikimedia.org/r/963745 (owner: 10Muehlenhoff) [10:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:09] (03CR) 10Jcrespo: "A minor nit about spaces." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [10:34:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:41] (03CR) 10Jcrespo: Switch ES cluster to cluster28 and cluster29 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [10:42:16] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:46:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:10] (03CR) 10Elukey: install_server: create aqs reuse partition reuse recipe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [10:54:17] (03CR) 10Elukey: "Do you have an example of the new text that should be matched?" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/965788 (owner: 10Eevans) [10:54:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:54:41] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) While I managed to upload the files mentioned on Wed, Oct 11... [10:56:01] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [10:56:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-coord1001.eqiad.wmnet [10:56:59] (03PS2) 10Ladsgroup: Switch ES cluster to cluster28 and cluster29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) [10:57:03] (03CR) 10Ladsgroup: Switch ES cluster to cluster28 and cluster29 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [10:57:14] (03CR) 10Ladsgroup: Switch ES cluster to cluster28 and cluster29 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [10:59:14] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Yann) This seems to be related to T328872. I get alternatively this error, a... [11:00:30] (03CR) 10Ladsgroup: Switch ES cluster to cluster28 and cluster29 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [11:03:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1001.eqiad.wmnet [11:03:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:07:48] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:07:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:08:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:15] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:10:19] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:11:28] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:15:14] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:15:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:09] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:22:31] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:24:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:47] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [11:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:41] (03PS1) 10Jbond: admin: extend contract for Bai [puppet] - 10https://gerrit.wikimedia.org/r/966167 [11:32:04] (03CR) 10Jbond: [C: 03+2] admin: extend contract for Bai [puppet] - 10https://gerrit.wikimedia.org/r/966167 (owner: 10Jbond) [11:46:14] (03PS1) 10KartikMistry: Update MinT to 2023-10-16-101614-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/966170 (https://phabricator.wikimedia.org/T333969) [11:49:43] PROBLEM - puppet last run on ganeti-test2004 is CRITICAL: CRITICAL: Puppet has been disabled for 604873 seconds, message: ganeti - ayounsi, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:49:56] (03PS4) 10Jbond: puppet: simplify debug code [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/965772 [11:51:19] (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: remove the call to destroy [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:51:22] (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:51:59] PROBLEM - puppet last run on ganeti-test1002 is CRITICAL: CRITICAL: Puppet has been disabled for 605017 seconds, message: ganeti - ayounsi, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:54:16] (03Merged) 10jenkins-bot: sre.hosts.reimage: remove the call to destroy [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:54:18] (03Merged) 10jenkins-bot: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:54:51] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [11:55:45] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Handle runner pausing exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [11:58:52] (03PS3) 10Jbond: puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 [11:59:04] (03CR) 10Jbond: "thanks, updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [12:00:51] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [12:02:51] (03PS6) 10Volans: cookbooks: acquire lock for each cookbook run [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) [12:02:53] (03PS2) 10Volans: dhcp: always rewrite the DHCP snippet [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 [12:02:55] (03PS2) 10Volans: dhcp: simplify tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 [12:02:57] (03PS1) 10Volans: tox.ini: remove optimization for tox <4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966171 [12:02:59] (03PS1) 10Volans: spicerack: add _spicerack_lock private accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/966172 (https://phabricator.wikimedia.org/T341973) [12:03:01] (03PS1) 10Volans: dhcp: acquire exclusive per-DC lock on write [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) [12:03:21] (03CR) 10CI reject: [V: 04-1] puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [12:03:41] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10cmooney) Regarding the UDP encapsulation it's an interesting idea, and is a reminder that currently our switches distribute flows based on source internet IP, which gives us lots o... [12:05:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [12:07:41] (03CR) 10Jbond: [C: 03+2] puppet: simplify debug code [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/965772 (owner: 10Jbond) [12:09:53] (03CR) 10CI reject: [V: 04-1] cookbooks: acquire lock for each cookbook run [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:10:23] (03CR) 10CI reject: [V: 04-1] spicerack: add _spicerack_lock private accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/966172 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:10:39] (03CR) 10Majavah: [C: 03+2] security: use concat to construct access.conf [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [12:12:04] (03PS7) 10Volans: cookbooks: acquire lock for each cookbook run [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) [12:12:06] (03PS2) 10Volans: spicerack: add _spicerack_lock private accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/966172 (https://phabricator.wikimedia.org/T341973) [12:12:08] (03PS3) 10Volans: dhcp: always rewrite the DHCP snippet [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 [12:12:10] (03PS3) 10Volans: dhcp: simplify tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 [12:12:12] (03PS2) 10Volans: dhcp: acquire exclusive per-DC lock on write [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) [12:13:39] (03PS3) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) [12:13:59] (03CR) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [12:14:48] jouncebot: nowandnext [12:14:49] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [12:14:49] In 0 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1300) [12:14:53] cool [12:15:22] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [12:16:02] (03Merged) 10jenkins-bot: Switch ES cluster to cluster28 and cluster29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [12:16:05] (03PS11) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [12:16:16] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:963720|Switch ES cluster to cluster28 and cluster29 (T342685)]] [12:16:21] T342685: Create cluster28 and cluster29 in existing es4 and es5 hosts - https://phabricator.wikimedia.org/T342685 [12:16:39] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [12:17:28] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:963720|Switch ES cluster to cluster28 and cluster29 (T342685)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:17:29] 10SRE, 10Fundraising-Backlog, 10SRE Observability: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10fgiunchedi) FWIW I'm +1 on the general idea of folding `fr-tech-ops` into `fr-tech`. And indeed I believe there should be no need to list individual users again... [12:20:54] (03PS12) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [12:22:41] (03PS1) 10Hashar: wm-checks-api: filter out Zuul start messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/966178 (https://phabricator.wikimedia.org/T348920) [12:24:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, and 2 others: update pcc with puppet 7 support - https://phabricator.wikimedia.org/T236373 (10jbond) p:05Low→03Medium [12:29:57] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:32:13] (03Abandoned) 10Hashar: Set conservative retry limits & delays [puppet] - 10https://gerrit.wikimedia.org/r/287148 (https://phabricator.wikimedia.org/T134456) (owner: 10GWicke) [12:34:40] (03Abandoned) 10Hashar: Redirect phabricator.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) (owner: 10Microchip08) [12:35:09] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:963720|Switch ES cluster to cluster28 and cluster29 (T342685)]] (duration: 18m 52s) [12:35:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:35:16] T342685: Create cluster28 and cluster29 in existing es4 and es5 hosts - https://phabricator.wikimedia.org/T342685 [12:37:34] (03PS1) 10Volans: sre.hosts.dhcp: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) [12:37:36] (03PS1) 10Volans: sre.hosts.provision: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) [12:37:38] (03PS1) 10Volans: sre.hosts.reimage: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) [12:37:42] (03PS1) 10Volans: sre.hosts.reboot-single: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) [12:37:44] (03PS1) 10Volans: tox.ini: remove optimization for tox <4 [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 [12:38:03] (sorry for the upcoming jenkins spam, CI failures are expected until we release spicerack with the new reference class) [12:40:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:40:28] (03CR) 10Volans: "CI Failures are expected until spicerack will be released" [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:40:44] (03CR) 10CI reject: [V: 04-1] sre.hosts.provision: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:40:50] (03CR) 10CI reject: [V: 04-1] sre.hosts.dhcp: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:41:14] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:41:23] (03CR) 10CI reject: [V: 04-1] sre.hosts.reboot-single: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:42:08] (03Abandoned) 10Hashar: admin: script to rush home directory [puppet] - 10https://gerrit.wikimedia.org/r/456690 (owner: 10Rush) [12:44:24] (03CR) 10CI reject: [V: 04-1] tox.ini: remove optimization for tox <4 [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans) [12:46:10] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) p:05Triage→03Low [12:50:00] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:52:50] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) [12:53:18] (03PS1) 10Cathal Mooney: Change IP for new Eqiad switches to MGMT IP [puppet] - 10https://gerrit.wikimedia.org/r/966195 (https://phabricator.wikimedia.org/T348977) [12:55:04] (just FYI, I won’t be around for the upcoming backport window) [12:56:48] (03PS2) 10Anzx: fix incubatorwiki wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965843 (https://phabricator.wikimedia.org/T348577) [12:59:48] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10ayounsi) Regarding MTU. We MUST NOT need to fragment any v4 packet. And MUST reduce the need of IPv6 PMTUD as much as possible. There are 2 main options: 1/ increase the MTU on a... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1300). [13:00:05] aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:36] o/ [13:01:18] * TheresNoTime can deploy :D [13:01:42] (one moment) [13:02:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965843 (https://phabricator.wikimedia.org/T348577) (owner: 10Anzx) [13:02:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965899 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [13:03:15] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T348903 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated both ends of power cable on psu2. alert cleared. [13:04:08] (03Merged) 10jenkins-bot: fix incubatorwiki wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965843 (https://phabricator.wikimedia.org/T348577) (owner: 10Anzx) [13:04:11] (03Merged) 10jenkins-bot: update throttle rule for UIUC Wikipedia edit-a-thon November 13, 2023 and remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965899 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [13:04:26] !log samtar@deploy2002 Started scap: Backport for [[gerrit:965843|fix incubatorwiki wordmark (T348577)]], [[gerrit:965899|update throttle rule for UIUC Wikipedia edit-a-thon November 13, 2023 and remove old throttle rules (T346043)]] [13:04:36] T348577: Incubator uses generic Wikimedia wordmark in Minerva header/footer - https://phabricator.wikimedia.org/T348577 [13:04:37] T346043: Lift IP caps for UIUC Wikipedia edit-a-thon (Oct13, Nov13, 2023) - https://phabricator.wikimedia.org/T346043 [13:04:53] aanzx: did both together fwiw :) [13:04:58] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:05:05] TheresNoTime: ok [13:05:10] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:05:38] !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:965843|fix incubatorwiki wordmark (T348577)]], [[gerrit:965899|update throttle rule for UIUC Wikipedia edit-a-thon November 13, 2023 and remove old throttle rules (T346043)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:49] Checking [13:05:53] ack [13:07:11] TheresNoTime: looks good [13:07:20] !log samtar@deploy2002 samtar and anzx: Continuing with sync [13:09:43] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10BBlack) Could we take the opposite approach with the MTU fixup for the tunneling, and arrange the host/interface settings on both sides (the LBs and the target hosts) such that the... [13:12:12] 10SRE, 10SRE-Access-Requests: New SSH key for Jeff Green - https://phabricator.wikimedia.org/T348981 (10Jgreen) [13:12:35] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:965843|fix incubatorwiki wordmark (T348577)]], [[gerrit:965899|update throttle rule for UIUC Wikipedia edit-a-thon November 13, 2023 and remove old throttle rules (T346043)]] (duration: 08m 08s) [13:12:37] aanzx: live :) can you double-check the logo is purged & there? [13:12:40] T348577: Incubator uses generic Wikimedia wordmark in Minerva header/footer - https://phabricator.wikimedia.org/T348577 [13:12:41] T346043: Lift IP caps for UIUC Wikipedia edit-a-thon (Oct13, Nov13, 2023) - https://phabricator.wikimedia.org/T346043 [13:12:47] Ok [13:13:50] TheresNoTime: checked logo is purged [13:13:58] Awesome [13:14:01] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:14:12] TheresNoTime: thank you [13:14:16] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:14:20] you're welcome! :-) [13:16:38] (03CR) 10Jforrester: [C: 03+1] Disable DoubleWiki extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965707 (https://phabricator.wikimedia.org/T344544) (owner: 10Ladsgroup) [13:16:41] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10cmooney) > There are 2 main options: > 2/ Decrease the MSS on the realservers (any host were a tunnel can terminate) > In a TCP handshake each side tells its peer what its MSS is,... [13:16:52] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy new Bullseye version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) [13:17:44] (03PS1) 10JMeybohm: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) [13:18:32] (03CR) 10CI reject: [V: 04-1] Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [13:22:16] (03CR) 10JMeybohm: [C: 03+2] docker-report: Add exclude action to filter rule [puppet] - 10https://gerrit.wikimedia.org/r/966151 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [13:25:29] (03PS2) 10JMeybohm: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) [13:25:54] (03PS2) 10Btullis: Remove the need for the analytics-meta database to require java [puppet] - 10https://gerrit.wikimedia.org/r/965761 (https://phabricator.wikimedia.org/T284150) [13:25:56] (03PS7) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [13:26:50] (03PS1) 10Filippo Giunchedi: graphite: bump uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/966201 (https://phabricator.wikimedia.org/T347221) [13:27:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44071/console" [puppet] - 10https://gerrit.wikimedia.org/r/965761 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [13:28:00] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10cmooney) >>! In T348837#9253673, @BBlack wrote: > If per-route MTU can usefully be set higher than base interface MTU, this seems trivial, While you can set MTUs on a route, afa... [13:28:04] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Jhancock.wm) [13:28:20] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Jhancock.wm) @Eevans I'm on site and can start the move process if you are ready. [13:29:03] (03PS4) 10Jbond: puppet: add support for puppetserver returning nonzero rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 [13:30:09] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:30:23] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:30:32] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:30:44] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:31:02] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: bump uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/966201 (https://phabricator.wikimedia.org/T347221) (owner: 10Filippo Giunchedi) [13:31:45] 10SRE-swift-storage, 10TimedMediaHandler, 10MW-1.41-notes (1.41.0-wmf.30; 2023-10-10), 10MW-1.42-notes (1.42.0-wmf.1; 2023-10-17), and 2 others: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.w... - https://phabricator.wikimedia.org/T348753 [13:32:48] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:32:53] (03CR) 10Herron: [C: 03+2] arclamp: add redis exporter and prom scrape config [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [13:32:55] (03PS1) 10Filippo Giunchedi: idp: bump graphite uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/966202 (https://phabricator.wikimedia.org/T347221) [13:33:02] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:33:04] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:33:15] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:33:16] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:33:31] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:33:32] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:33:49] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:33:50] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:34:11] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:34:12] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:34:16] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Jhancock.wm) my bad I meant for 10 am, will prep what I can though [13:34:35] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:34:36] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:34:38] !log close UTC afternoon backport window [13:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:54] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:35:43] (03CR) 10Filippo Giunchedi: [C: 03+2] idp: bump graphite uWSGImaxVars [puppet] - 10https://gerrit.wikimedia.org/r/966202 (https://phabricator.wikimedia.org/T347221) (owner: 10Filippo Giunchedi) [13:35:55] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [13:36:06] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [13:36:17] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [13:36:26] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [13:36:31] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [13:36:40] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [13:37:51] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:38:44] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:39:05] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:39:47] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:40:07] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:41:18] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:41:35] (03PS22) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:41:36] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:42:11] (03CR) 10Ilias Sarantopoulos: "Done, sorry I missed that!" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:42:16] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:43:17] (03CR) 10CI reject: [V: 04-1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:44:11] (03PS2) 10Jbond: mcrouter_pools: drop the use of alert [puppet] - 10https://gerrit.wikimedia.org/r/965775 [13:44:13] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Eevans) I'm available now if you wanted to start early [13:44:28] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965775 (owner: 10Jbond) [13:48:20] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:48:48] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:50:49] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965775 (owner: 10Jbond) [13:51:42] (03Abandoned) 10Cathal Mooney: Change IP for new Eqiad switches to MGMT IP [puppet] - 10https://gerrit.wikimedia.org/r/966195 (https://phabricator.wikimedia.org/T348977) (owner: 10Cathal Mooney) [13:52:10] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:52:30] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:52:38] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:52:55] (03CR) 10Volans: [C: 03+1] "LGTM, just missing one test case." [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [13:53:05] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:55:14] (03PS23) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:56:58] (03PS1) 10Ssingh: hiera: remove dns4003 for authdns_servers for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/966206 (https://phabricator.wikimedia.org/T342154) [13:58:19] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:29] (03CR) 10Ssingh: [C: 03+2] hiera: remove dns4003 for authdns_servers for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/966206 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [13:58:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:58:53] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) >>! In T348837#9253425, @cmooney wrote: > Regarding the UDP encapsulation it's an interesting idea, and is a reminder that currently our switches distribute flows based... [13:59:49] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:53] jouncebot: nowandnext [14:00:53] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [14:00:54] In 1 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1530) [14:01:01] (03PS2) 10Ladsgroup: Disable DoubleWiki extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965707 (https://phabricator.wikimedia.org/T344544) [14:01:05] (03CR) 10Ladsgroup: [C: 03+2] Disable DoubleWiki extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965707 (https://phabricator.wikimedia.org/T344544) (owner: 10Ladsgroup) [14:01:56] (03Merged) 10jenkins-bot: Disable DoubleWiki extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965707 (https://phabricator.wikimedia.org/T344544) (owner: 10Ladsgroup) [14:02:13] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:965707|Disable DoubleWiki extension everywhere (T344544)]] [14:02:19] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:24] ^ expected [14:02:29] T344544: Merge, undeploy, and archive the DoubleWiki extension - https://phabricator.wikimedia.org/T344544 [14:02:41] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Jhancock.wm) [14:02:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bookworm [14:03:11] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:27] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:965707|Disable DoubleWiki extension everywhere (T344544)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:03:37] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:04:44] (03PS1) 10Jforrester: wikifunctions: Update orchestrator image to latest for logging benefits [deployment-charts] - 10https://gerrit.wikimedia.org/r/966207 [14:04:46] (03PS1) 10Jforrester: wikifunctions: Update evaluators to latest ahead of WASM switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/966208 [14:05:13] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [14:06:15] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:06:15] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:06:31] Amir1: OK for me to do a quick service deploy? [14:06:40] I'll be done soon [14:06:47] Ack. [14:06:50] Will wait then. [14:10:06] (03CR) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [14:10:22] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:965707|Disable DoubleWiki extension everywhere (T344544)]] (duration: 08m 09s) [14:10:37] T344544: Merge, undeploy, and archive the DoubleWiki extension - https://phabricator.wikimedia.org/T344544 [14:10:45] James_F: done now [14:10:50] the floor is yours [14:10:52] (03PS1) 10Herron: prometheus: add redis_arclamp to redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/966209 (https://phabricator.wikimedia.org/T348756) [14:10:57] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Update orchestrator image to latest for logging benefits [deployment-charts] - 10https://gerrit.wikimedia.org/r/966207 (owner: 10Jforrester) [14:11:07] PROBLEM - Host 2620:0:863:1:198:35:26:7 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:51] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator image to latest for logging benefits [deployment-charts] - 10https://gerrit.wikimedia.org/r/966207 (owner: 10Jforrester) [14:12:09] ^ this is dns4003, depooled, and downtimed. why it's alerting, not sure but I will check [14:12:18] (03CR) 10Eevans: install_server: create aqs reuse partition reuse recipe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [14:12:29] sukhe: that's a "host" definition in icinga with the IP [14:12:36] so not automatically downtimed by any hostname matching [14:13:00] (03PS3) 10Eevans: install_server: create aqs reuse partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) [14:13:03] yeah I am trying to see where it's coming from though [14:13:35] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:09] modules/dnsrecursor/manifests/monitor.pp [14:14:41] PROBLEM - Recursive DNS on 198.35.26.7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:14:56] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44072/console" [puppet] - 10https://gerrit.wikimedia.org/r/966209 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [14:15:02] so both of these are from the same source. and in the past we have not silenced them. I will see if that's worth doing [14:15:14] there might be some value in just having these here too so I am a bit split [14:15:47] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:01] sukhe: you can downtime them with the downtime cookbook, knowing the hostname ofc [14:16:10] where hostname here is the IP itself ;) [14:16:32] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:16:33] volans: yeah, just figuring out if we should do that as part of our regular maint work or not [14:16:49] you know, you could have a cookbook that does that for you :-P [14:16:54] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:17:23] yeah, soon™, once we remove the last blocker for T347054 :) [14:17:24] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [14:17:45] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:17:48] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:18:35] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:18:44] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Update evaluators to latest ahead of WASM switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/966208 (owner: 10Jforrester) [14:19:45] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators to latest ahead of WASM switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/966208 (owner: 10Jforrester) [14:20:24] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:20:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:21:06] (03PS4) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) [14:21:27] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:21:36] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10BBlack) >>! In T348837#9253720, @cmooney wrote: > The one thing you may not be able to control with mtu/advmss on a route is traffic to the local subnet, as that route is added by... [14:22:17] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:22:22] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:22:29] (03CR) 10Herron: [V: 03+1 C: 03+2] prometheus: add redis_arclamp to redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/966209 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [14:23:10] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:23:16] (03CR) 10Jelto: "looks mostly good to me to make httpbb tests work. Although I like to approach of keeping the CI servers as isolated as possible. But allo" [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [14:23:33] RECOVERY - Host 2620:0:863:1:198:35:26:7 is UP: PING OK - Packet loss = 0%, RTA = 72.19 ms [14:23:38] OK, all done. [14:24:04] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [14:24:48] (03PS1) 10Ladsgroup: wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) [14:25:04] (03Merged) 10jenkins-bot: ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [14:25:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [14:26:47] (03PS2) 10Jforrester: wikifunctions: Rev charts to 0.2.0, move TODOs around for clarity [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239 [14:26:54] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [14:27:05] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:28:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [14:29:04] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Rev charts to 0.2.0, move TODOs around for clarity [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239 (owner: 10Jforrester) [14:29:59] (03Merged) 10jenkins-bot: wikifunctions: Rev charts to 0.2.0, move TODOs around for clarity [deployment-charts] - 10https://gerrit.wikimedia.org/r/965239 (owner: 10Jforrester) [14:30:31] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:30:34] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:32:27] (RedisMemoryFull) firing: Redis memory full on arclamp1001:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_arclamp - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_arclamp&var-instance=arclamp1001:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:33:08] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:33:37] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:33:48] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:34:32] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) [14:34:41] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:34:53] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:35:47] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:37:23] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) BTW This is also the approach recommended by Katran >>! In T348837#9253591, @ayounsi wrote: > So my preference here would go to option (2) limit the MSS on the relevant... [14:38:35] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:25] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:7 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:41:19] (03PS1) 10Herron: arclamp::redis: double maxmemory [puppet] - 10https://gerrit.wikimedia.org/r/966216 (https://phabricator.wikimedia.org/T348756) [14:41:23] RECOVERY - Recursive DNS on 198.35.26.7 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:42:17] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:42:57] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44073/console" [puppet] - 10https://gerrit.wikimedia.org/r/966216 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [14:43:35] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:55] (03CR) 10Herron: [V: 03+1 C: 03+2] "self-merging -- attempting to resolve open (RedisMemoryFull) firing: Redis memory full on arclamp1001:9121 alert" [puppet] - 10https://gerrit.wikimedia.org/r/966216 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [14:46:32] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10BBlack) One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica... [14:48:18] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) for QUIC there are ongoing efforts like https://datatracker.ietf.org/doc/draft-pskim-passive-probing-pmtud/ [14:49:24] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:05] (03PS1) 10DLynch: Merge ReplyWidget[Plain/Visual] modules [extensions/DiscussionTools] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965904 (https://phabricator.wikimedia.org/T348834) [14:52:27] (RedisMemoryFull) resolved: Redis memory full on arclamp1001:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_arclamp - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_arclamp&var-instance=arclamp1001:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:53:49] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:53:49] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:20] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on sessionstore2001.codfw.wmnet with reason: Moving host — T348142 [14:54:26] T348142: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 [14:54:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:54:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sessionstore2001.codfw.wmnet with reason: Moving host — T348142 [14:55:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4003.wikimedia.org with OS bookworm [14:55:43] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:11] (03PS1) 10Ssingh: Revert "hiera: remove dns4003 for authdns_servers for reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/965905 [14:56:32] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10dcaro) Hmm, elasticsearch-curator also tries to build and install pyyaml 3 if it's not there, and failing for me on python3.11. Note that... [14:56:34] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) >>! In T348446#9252677, @ayounsi wrote: > Maybe prepending the AS on the backup LVS is easier to do than expected? > i though PyBal's developm... [14:57:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:58:41] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:58:55] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:02:30] (03CR) 10Jbond: [C: 03+1] "LGTM however FYI tox in bookworm is still only 3.28" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966171 (owner: 10Volans) [15:03:34] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:04:35] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: remove dns4003 for authdns_servers for reimaging" [puppet] - 10https://gerrit.wikimedia.org/r/965905 (owner: 10Ssingh) [15:08:13] !log running authdns-update [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:10:08] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966171 (owner: 10Volans) [15:10:27] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for sessionstore2001.codfw.wmnet [15:10:27] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sessionstore2001.codfw.wmnet [15:10:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966172 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:12:28] (03PS1) 10Fabfur: haproxy: start working on healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 [15:13:19] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [15:13:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:16:10] (03CR) 10Jbond: [C: 03+1] tox.ini: remove optimization for tox <4 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966171 (owner: 10Volans) [15:17:44] (03CR) 10Jbond: [C: 03+1] "lgtm , the interface is quite nice and simple FYI" [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:18:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:18:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:18:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:18:55] (03CR) 10Jbond: [C: 03+1] "lgtm with the same comments as spicerack" [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans) [15:19:47] (03PS1) 10Marostegui: check_private_data: Add Arnaud [puppet] - 10https://gerrit.wikimedia.org/r/966222 [15:20:37] (03PS2) 10Marostegui: check_private_data: Add Arnaud [puppet] - 10https://gerrit.wikimedia.org/r/966222 [15:21:50] (03CR) 10Ssingh: [C: 03+1] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [15:23:07] (03CR) 10Volans: [C: 03+2] tox.ini: remove optimization for tox <4 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966171 (owner: 10Volans) [15:27:59] (03CR) 10Jbond: [C: 03+1] dhcp: acquire exclusive per-DC lock on write (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1530). [15:30:43] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:10] 10SRE, 10ops-eqiad: 1 PSU down on both lsw1-e5-eqiad and lsw1-e7-eqiad - https://phabricator.wikimedia.org/T349002 (10cmooney) p:05Triage→03Medium [15:32:13] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:23] (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966171 (owner: 10Volans) [15:32:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T343198)', diff saved to https://phabricator.wikimedia.org/P52971 and previous config saved to /var/cache/conftool/dbconfig/20231016-153247-arnaudb.json [15:32:53] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:34:11] (03CR) 10Herron: [V: 03+1 C: 03+2] alertmanager::api: enable POST logging [puppet] - 10https://gerrit.wikimedia.org/r/965785 (https://phabricator.wikimedia.org/T321579) (owner: 10Herron) [15:40:05] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10collaboration-services: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10LSobanski) a:03Arnoldokoth No reports for two years. L... [15:41:13] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) a:03dcaro [15:42:47] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) [15:43:00] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Papaul) 05Open→03Resolved This is complete. [15:47:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P52972 and previous config saved to /var/cache/conftool/dbconfig/20231016-154754-arnaudb.json [15:52:18] everything going good with that little timedmediahandler fix? :D [15:52:53] i love it when things are simple backports :D [16:00:04] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P52973 and previous config saved to /var/cache/conftool/dbconfig/20231016-160300-arnaudb.json [16:05:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:09] (03CR) 10Eevans: streams: update regex for 4.x `nodetool netstats` output (031 comment) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/965788 (owner: 10Eevans) [16:05:15] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:55] (03PS1) 10Ebernhardson: cirrus-updater: Staging to read from prod, write to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/966246 [16:07:57] (03PS1) 10Ebernhardson: cirrus-updater: Add routes for event stream to use local proxy for schema access [deployment-charts] - 10https://gerrit.wikimedia.org/r/966247 (https://phabricator.wikimedia.org/T347075) [16:10:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [16:15:31] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) @Jclark-ctr do you need the logs as specified here (https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshoo... [16:16:52] (03CR) 10JMeybohm: [C: 03+2] base.helper: Allow to use ClusterIP services [deployment-charts] - 10https://gerrit.wikimedia.org/r/965717 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [16:16:55] (03CR) 10JMeybohm: [C: 03+2] Add new version for base.helper (1.1.1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965716 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [16:17:21] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Staging to read from prod, write to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/966246 (owner: 10Ebernhardson) [16:17:24] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Add routes for event stream to use local proxy for schema access [deployment-charts] - 10https://gerrit.wikimedia.org/r/966247 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:17:44] (03Merged) 10jenkins-bot: Add new version for base.helper (1.1.1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965716 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [16:18:03] (03Merged) 10jenkins-bot: base.helper: Allow to use ClusterIP services [deployment-charts] - 10https://gerrit.wikimedia.org/r/965717 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [16:18:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T343198)', diff saved to https://phabricator.wikimedia.org/P52974 and previous config saved to /var/cache/conftool/dbconfig/20231016-161806-arnaudb.json [16:18:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:18:21] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:18:23] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:18:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T343198)', diff saved to https://phabricator.wikimedia.org/P52975 and previous config saved to /var/cache/conftool/dbconfig/20231016-161829-arnaudb.json [16:18:37] (03Merged) 10jenkins-bot: cirrus-updater: Staging to read from prod, write to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/966246 (owner: 10Ebernhardson) [16:18:39] (03Merged) 10jenkins-bot: cirrus-updater: Add routes for event stream to use local proxy for schema access [deployment-charts] - 10https://gerrit.wikimedia.org/r/966247 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:20:21] James_F: Can we attempt to merge 965889 again? CI passed this time. [16:20:37] thanks for the reviews [16:20:42] Of course. Done. [16:21:08] Thanks again. Let's see if it gets stuck again [16:23:49] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:24:17] (03PS2) 10JMeybohm: wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) [16:25:18] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:30:06] (03CR) 10Elukey: [C: 03+1] streams: update regex for 4.x `nodetool netstats` output [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/965788 (owner: 10Eevans) [16:30:43] (03CR) 10JMeybohm: wikifunctions: Use ClusterIP services for evaluators (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [16:31:28] (03PS1) 10Andrew Bogott: wmf_sink: correct calls to get_keystone_session [puppet] - 10https://gerrit.wikimedia.org/r/966253 [16:31:31] (03PS1) 10Elukey: ml-services: Update Docker image for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/966254 (https://phabricator.wikimedia.org/T340507) [16:32:38] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: correct calls to get_keystone_session [puppet] - 10https://gerrit.wikimedia.org/r/966253 (owner: 10Andrew Bogott) [16:35:16] (03PS1) 10Jelto: gitlab: add hardware 2fa issues to gitlab-replica banner [puppet] - 10https://gerrit.wikimedia.org/r/966255 (https://phabricator.wikimedia.org/T330639) [16:35:47] (03PS1) 10Stevemunene: airflow-wmde: Place airflow1007 in airflow-wmde role [puppet] - 10https://gerrit.wikimedia.org/r/966256 (https://phabricator.wikimedia.org/T340648) [16:36:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10procurement: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10elukey) @RobH Hi! Lemme know if I can help in any way to move this forward, it is a non standard request I know, sorry and thanks for the patience! [16:36:48] (03PS1) 10Ebernhardson: cirrus-updater: Correct event-stream http route definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966257 [16:36:56] (03CR) 10CI reject: [V: 04-1] cirrus-updater: Correct event-stream http route definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966257 (owner: 10Ebernhardson) [16:37:03] ACKNOWLEDGEMENT - Host lsw1-f7-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected - loopback not pingable until we have 1 server connected. [16:37:15] ACKNOWLEDGEMENT - Host lsw1-f7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected - loopback not pingable until we have 1 server connected. [16:38:29] (03PS2) 10Ebernhardson: cirrus-updater: Correct event-stream http route definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966257 [16:38:45] ACKNOWLEDGEMENT - Host lsw1-f6-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected - loopback not pingable until we have 1 server connected. [16:39:24] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Correct event-stream http route definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966257 (owner: 10Ebernhardson) [16:40:23] (03Merged) 10jenkins-bot: cirrus-updater: Correct event-stream http route definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966257 (owner: 10Ebernhardson) [16:42:58] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:43:13] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:45:02] (03CR) 10Elukey: [C: 03+2] ml-services: Update Docker image for langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/966254 (https://phabricator.wikimedia.org/T340507) (owner: 10Elukey) [16:46:06] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [16:48:16] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [16:50:16] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:50:52] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: add hardware 2fa issues to gitlab-replica banner [puppet] - 10https://gerrit.wikimedia.org/r/966255 (https://phabricator.wikimedia.org/T330639) (owner: 10Jelto) [16:51:52] (03CR) 10LSobanski: [C: 03+1] gitlab: add hardware 2fa issues to gitlab-replica banner [puppet] - 10https://gerrit.wikimedia.org/r/966255 (https://phabricator.wikimedia.org/T330639) (owner: 10Jelto) [16:52:34] (03CR) 10Eevans: [V: 03+2 C: 03+2] streams: update regex for 4.x `nodetool netstats` output [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/965788 (owner: 10Eevans) [16:55:45] (03CR) 10JMeybohm: [C: 03+1] Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [16:55:53] ACKNOWLEDGEMENT - Host lsw1-f6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:56:05] ACKNOWLEDGEMENT - Host lsw1-f5-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:56:13] (03CR) 10JMeybohm: [C: 03+1] flink-app chart: Add zookeeper to egress_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/963130 (owner: 10Ebernhardson) [16:56:15] ACKNOWLEDGEMENT - Host lsw1-f5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:56:25] ACKNOWLEDGEMENT - Host lsw1-e7-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:56:35] ACKNOWLEDGEMENT - Host lsw1-e7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:56:45] ACKNOWLEDGEMENT - Host lsw1-e6-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:57:00] ACKNOWLEDGEMENT - Host lsw1-e6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:57:10] ACKNOWLEDGEMENT - Host lsw1-e5-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:57:25] ACKNOWLEDGEMENT - Host lsw1-e5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Expected, loopback int will be down until we connect servers. T348977 [16:57:53] (03CR) 10JMeybohm: [C: 03+1] trafficserver: move 15% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964447 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [16:57:58] (03CR) 10JMeybohm: [C: 03+1] trafficserver: move 20% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964448 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [16:58:02] (03CR) 10JMeybohm: [C: 03+1] trafficserver: move 25% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964449 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1700) [17:00:06] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T1700). [17:01:23] (03CR) 10JMeybohm: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [17:02:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:02:25] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:29] (03CR) 10Ryan Kemper: [C: 03+1] airflow-wmde: Place airflow1007 in airflow-wmde role [puppet] - 10https://gerrit.wikimedia.org/r/966256 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:06:25] (03CR) 10Ryan Kemper: [C: 03+1] airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:06:47] (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:08:44] (03CR) 10Volans: [C: 03+2] cookbooks: acquire lock for each cookbook run [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:10:52] (03CR) 10Volans: [C: 03+2] spicerack: add _spicerack_lock private accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/966172 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:11:08] (03PS11) 10Ryan Kemper: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:11:13] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:11:20] (03CR) 10Volans: [C: 03+2] dhcp: always rewrite the DHCP snippet [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans) [17:11:36] (03CR) 10Volans: [C: 03+2] dhcp: simplify tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 (owner: 10Volans) [17:13:00] (03PS2) 10Ryan Kemper: airflow-wmde: Place airflow1007 in airflow-wmde role [puppet] - 10https://gerrit.wikimedia.org/r/966256 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:13:09] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966256 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:15:33] (03Merged) 10jenkins-bot: cookbooks: acquire lock for each cookbook run [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:16:25] (03CR) 10Ryan Kemper: [C: 03+1] airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:16:34] (03CR) 10Alex Paskulin: [C: 03+1] "Looks great! Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966148 (https://phabricator.wikimedia.org/T343268) (owner: 10Hnowlan) [17:17:44] (03Merged) 10jenkins-bot: spicerack: add _spicerack_lock private accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/966172 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:18:25] (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:18:58] (03Merged) 10jenkins-bot: dhcp: always rewrite the DHCP snippet [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans) [17:19:10] (03Merged) 10jenkins-bot: dhcp: simplify tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 (owner: 10Volans) [17:19:55] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966256 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:23:46] (03PS3) 10Volans: dhcp: acquire exclusive per-DC lock on write [software/spicerack] - 10https://gerrit.wikimedia.org/r/966173 (https://phabricator.wikimedia.org/T341973) [17:28:47] (03CR) 10Stevemunene: [C: 03+2] airflow-wmde: Place airflow1007 in airflow-wmde role [puppet] - 10https://gerrit.wikimedia.org/r/966256 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:29:27] (03CR) 10Btullis: "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:29:32] (03CR) 10Btullis: [C: 03+1] airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:41:33] !log Upgrading navtiming on the webperf hosts in the beta cluster [17:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:28] (03PS12) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:42:58] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:45:26] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:46:05] (03PS1) 10Ebernhardson: cirrus updater: Adjust event stream http route to match updated config [deployment-charts] - 10https://gerrit.wikimedia.org/r/966268 (https://phabricator.wikimedia.org/T347075) [17:48:05] (03CR) 10DCausse: [C: 03+1] cirrus updater: Adjust event stream http route to match updated config [deployment-charts] - 10https://gerrit.wikimedia.org/r/966268 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [17:52:28] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Adjust event stream http route to match updated config [deployment-charts] - 10https://gerrit.wikimedia.org/r/966268 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [17:53:16] (03Merged) 10jenkins-bot: cirrus updater: Adjust event stream http route to match updated config [deployment-charts] - 10https://gerrit.wikimedia.org/r/966268 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [17:55:34] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:55:47] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:58:33] 10SRE, 10ops-codfw, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10Papaul) @nskaggs hello true that codfw will me moving to the EVPN/VXLAN design but codfw doesn't have that many racks to dedicate 2 r... [17:59:04] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:12] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [17:59:26] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [18:04:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:07:11] (03PS1) 10Ebernhardson: cirrus updater: Correct http route path suffixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/966275 [18:11:14] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Correct http route path suffixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/966275 (owner: 10Ebernhardson) [18:11:55] (03Merged) 10jenkins-bot: cirrus updater: Correct http route path suffixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/966275 (owner: 10Ebernhardson) [18:19:00] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:20:51] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:27:06] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [18:27:12] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [18:27:16] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [18:32:26] (03PS41) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [18:32:28] (03PS1) 10AOkoth: vrts: add new required packages v6.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/966279 (https://phabricator.wikimedia.org/T348987) [18:32:51] (03PS2) 10AOkoth: vrts: add new required packages v6.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/966279 (https://phabricator.wikimedia.org/T348987) [18:40:00] (03CR) 10Dr0ptp4kt: "Looping Dan in case of any considerations - I see this table in https://phabricator.wikimedia.org/source/operations-puppet/browse/producti" [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [18:41:06] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:41:33] (03CR) 10Ladsgroup: wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [18:41:59] (03PS3) 10AOkoth: vrts: add new required packages v6.3.4 [puppet] - 10https://gerrit.wikimedia.org/r/966279 (https://phabricator.wikimedia.org/T348987) [18:42:11] (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 4422 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [18:42:17] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:42:28] !incidents [18:42:29] 4131 (UNACKED) MXQueueHigh misc sre (mx1001:9100 node eqiad) [18:42:32] !ack 4131 [18:42:33] 4131 (ACKED) MXQueueHigh misc sre (mx1001:9100 node eqiad) [18:42:56] ^ looking [18:43:35] (03PS5) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [18:43:36] here [18:43:37] (03PS1) 10Jdlrobson: Fixes Thai Wikinews wordmark and sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966280 (https://phabricator.wikimedia.org/T348757) [18:43:39] (03PS1) 10Jdlrobson: Update logos for remaining Wikisource projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966281 (https://phabricator.wikimedia.org/T343753) [18:44:12] denisse: there is a command to check the queue on the docs [18:44:33] sukhe: Yes, looking at this one: https://wikitech.wikimedia.org/wiki/Exim [18:46:37] I have seen this email address before [18:48:34] I am going to exim -q it [18:50:50] in fact, no, I will do just for that specific user [18:51:15] !log exiqgrep -i -r | xargs exim -Mrm [18:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:11] (MXQueueHigh) resolved: MX host mx1001:9100 has many queued messages: 4276 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [18:52:31] !ack 4132 [18:52:32] Attempt to ack incident 4132 failed. [18:52:37] !incidents [18:52:37] 4131 (RESOLVED) MXQueueHigh misc sre (mx1001:9100 node eqiad) [18:52:47] ok :) [18:52:59] :D [18:53:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [18:57:35] thanks denisse for ACKing so quickly! [18:58:29] Anytime! :D Thanks to you for always helping out. ^^ [19:02:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:04:27] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10jhathaway) >>! In T348837#9253425, @cmooney wrote: > Either way, katran/liberica uses IPIP, so having the option for GUE in IPVS doesn't solve that problem if we hit it. I think w... [19:08:45] (03PS1) 10Ebernhardson: cirrus updater: topic-prefix-filter must be plain string prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/966283 (https://phabricator.wikimedia.org/T347075) [19:09:40] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts aqs1010.eqiad.wmnet [19:09:58] (03PS2) 10Ebernhardson: cirrus updater: topic-prefix-filter must be plain string prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/966283 (https://phabricator.wikimedia.org/T347075) [19:10:49] (03PS6) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [19:10:51] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: topic-prefix-filter must be plain string prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/966283 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [19:11:35] (03Merged) 10jenkins-bot: cirrus updater: topic-prefix-filter must be plain string prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/966283 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [19:12:47] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:13:08] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:15:54] (03PS1) 10Ebernhardson: cirrus updater: pipeline.name is a required parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/966285 (https://phabricator.wikimedia.org/T347075) [19:16:34] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10jhathaway) >>! In T348837#9254127, @BBlack wrote: > One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long... [19:17:27] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts aqs1010.eqiad.wmnet [19:17:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:18:21] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: pipeline.name is a required parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/966285 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [19:19:23] (03Merged) 10jenkins-bot: cirrus updater: pipeline.name is a required parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/966285 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [19:20:40] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:20:53] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:23:36] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts aqs1010.eqiad.wmnet [19:23:55] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts aqs1010.eqiad.wmnet [19:25:35] (03PS1) 10Ebernhardson: cirrus updater: Correct placement of pipeline.name parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/966287 [19:27:50] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:27:53] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:28:05] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Correct placement of pipeline.name parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/966287 (owner: 10Ebernhardson) [19:28:47] (03Merged) 10jenkins-bot: cirrus updater: Correct placement of pipeline.name parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/966287 (owner: 10Ebernhardson) [19:30:02] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:30:17] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:38:45] (03PS1) 10Ebernhardson: cirrus updater: Correct kafka bootstrap servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/966288 [19:39:48] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Correct kafka bootstrap servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/966288 (owner: 10Ebernhardson) [19:40:45] (03Merged) 10jenkins-bot: cirrus updater: Correct kafka bootstrap servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/966288 (owner: 10Ebernhardson) [19:42:24] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:42:36] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:48:35] (03PS1) 10Ebernhardson: cirrus updater: Correct invalid stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/966290 [19:53:08] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Correct invalid stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/966290 (owner: 10Ebernhardson) [19:53:13] (03CR) 10Dr0ptp4kt: wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [19:53:57] (03Merged) 10jenkins-bot: cirrus updater: Correct invalid stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/966290 (owner: 10Ebernhardson) [19:55:20] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:55:31] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:57:14] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [19:58:48] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T2000) [20:00:05] Dreamy_Jazz, kemayo, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] \o [20:00:54] 👋🏻 [20:01:00] RECOVERY - Check systemd state on ms-be1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:00] o/ [20:07:16] Any deployers around? [20:07:31] hi - sorry i'm late - i can deploy [20:07:35] :D [20:07:36] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:07:47] (03PS6) 10Clare Ming: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:08:07] Dreamy_Jazz: i'll start with yours [20:08:23] Thanks. [20:08:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:09:09] (03CR) 10Clare Ming: [C: 03+2] Merge ReplyWidget[Plain/Visual] modules [extensions/DiscussionTools] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965904 (https://phabricator.wikimedia.org/T348834) (owner: 10DLynch) [20:09:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:10:02] (03Merged) 10jenkins-bot: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [20:10:16] !log cjming@deploy2002 Started scap: Backport for [[gerrit:964545|Enable display of Client Hints data on all wikis (T341110 T337942)]] [20:10:28] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [20:10:28] T337942: Display client hint data - https://phabricator.wikimedia.org/T337942 [20:11:29] !log cjming@deploy2002 dreamyjazz and cjming: Backport for [[gerrit:964545|Enable display of Client Hints data on all wikis (T341110 T337942)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:33] Dreamy_Jazz: can you test? [20:11:37] Yup [20:13:12] Test successful. [20:13:16] cool - syncing [20:13:19] !log cjming@deploy2002 dreamyjazz and cjming: Continuing with sync [20:13:58] (03PS7) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [20:14:00] (03Merged) 10jenkins-bot: Merge ReplyWidget[Plain/Visual] modules [extensions/DiscussionTools] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965904 (https://phabricator.wikimedia.org/T348834) (owner: 10DLynch) [20:18:33] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:964545|Enable display of Client Hints data on all wikis (T341110 T337942)]] (duration: 08m 17s) [20:18:36] Dreamy_Jazz: should be live [20:18:39] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [20:18:39] T337942: Display client hint data - https://phabricator.wikimedia.org/T337942 [20:18:41] Kemayo: onto yours [20:18:52] Thanks! [20:18:57] 🎉 [20:19:13] !log cjming@deploy2002 Started scap: Backport for [[gerrit:965904|Merge ReplyWidget[Plain/Visual] modules (T348834)]] [20:19:17] T348834: Reply and new topic tools fail to load - https://phabricator.wikimedia.org/T348834 [20:20:32] !log cjming@deploy2002 kemayo and cjming: Backport for [[gerrit:965904|Merge ReplyWidget[Plain/Visual] modules (T348834)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:20:36] Kemayo: able to test? [20:20:45] cjming: Testing now [20:21:09] cjming: Works fine [20:21:15] nice - going live [20:21:18] !log cjming@deploy2002 kemayo and cjming: Continuing with sync [20:23:10] (03PS8) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [20:23:23] hi Jdlrobson: i'll start with the Thai wikinews patch? [20:25:27] (03PS9) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [20:25:29] cjming: yes please! [20:26:37] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:965904|Merge ReplyWidget[Plain/Visual] modules (T348834)]] (duration: 07m 23s) [20:26:42] T348834: Reply and new topic tools fail to load - https://phabricator.wikimedia.org/T348834 [20:26:50] Kemayo: should be live [20:27:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966280 (https://phabricator.wikimedia.org/T348757) (owner: 10Jdlrobson) [20:27:11] cjming: Thanks! [20:28:02] (03Merged) 10jenkins-bot: Fixes Thai Wikinews wordmark and sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966280 (https://phabricator.wikimedia.org/T348757) (owner: 10Jdlrobson) [20:28:15] !log cjming@deploy2002 Started scap: Backport for [[gerrit:966280|Fixes Thai Wikinews wordmark and sewikimedia (T348757 T347534)]] [20:28:21] T347534: Create and deploy tagline for sewikimedia - https://phabricator.wikimedia.org/T347534 [20:28:22] T348757: Provide a wordmark for Thai Wikinews - https://phabricator.wikimedia.org/T348757 [20:29:30] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:966280|Fixes Thai Wikinews wordmark and sewikimedia (T348757 T347534)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:35] Jdlrobson: mind testing? [20:29:36] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1049 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:29:39] on it [20:29:59] cjming: LGTM please sync [20:30:08] will do [20:30:11] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [20:30:46] (03PS10) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [20:31:29] Jdlrobson: each of your other patches needs to rebased on master right? [20:31:57] relations chains confuse me sometimes [20:33:03] cjming: yep [20:33:16] (03PS2) 10Jdlrobson: Update logos for remaining Wikisource projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966281 (https://phabricator.wikimedia.org/T343753) [20:33:19] next up: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/966281 [20:35:24] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:966280|Fixes Thai Wikinews wordmark and sewikimedia (T348757 T347534)]] (duration: 07m 08s) [20:35:29] T347534: Create and deploy tagline for sewikimedia - https://phabricator.wikimedia.org/T347534 [20:35:30] T348757: Provide a wordmark for Thai Wikinews - https://phabricator.wikimedia.org/T348757 [20:35:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966281 (https://phabricator.wikimedia.org/T343753) (owner: 10Jdlrobson) [20:36:26] (03Merged) 10jenkins-bot: Update logos for remaining Wikisource projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966281 (https://phabricator.wikimedia.org/T343753) (owner: 10Jdlrobson) [20:36:41] !log cjming@deploy2002 Started scap: Backport for [[gerrit:966281|Update logos for remaining Wikisource projects (T343753)]] [20:36:45] T343753: Update logos for remaining Wikisource projects - https://phabricator.wikimedia.org/T343753 [20:37:56] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:966281|Update logos for remaining Wikisource projects (T343753)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:38:19] Jdlrobson: can you test? [20:38:41] cjming: on it [20:39:13] cjming: LGTM! Please sync! [20:39:20] ok! [20:39:24] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [20:40:39] (03PS11) 10Clare Ming: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) (owner: 10Jdlrobson) [20:44:32] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:966281|Update logos for remaining Wikisource projects (T343753)]] (duration: 07m 50s) [20:44:36] T343753: Update logos for remaining Wikisource projects - https://phabricator.wikimedia.org/T343753 [20:44:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) (owner: 10Jdlrobson) [20:45:28] (03Merged) 10jenkins-bot: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) (owner: 10Jdlrobson) [20:45:42] !log cjming@deploy2002 Started scap: Backport for [[gerrit:960147|wordmarks/taglines for Wiktionary projects (T341257)]] [20:45:47] T341257: Design: Provide wordmarks/taglines for Wiktionary projects - https://phabricator.wikimedia.org/T341257 [20:46:55] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:960147|wordmarks/taglines for Wiktionary projects (T341257)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:59] Jdlrobson: last patch ready to test [20:47:09] cjming: on it! [20:47:36] cjming: this also lgtm! [20:47:45] great - syncing [20:47:47] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [20:50:16] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:52:59] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:960147|wordmarks/taglines for Wiktionary projects (T341257)]] (duration: 07m 17s) [20:53:07] T341257: Design: Provide wordmarks/taglines for Wiktionary projects - https://phabricator.wikimedia.org/T341257 [20:53:29] Jdlrobson: should be all live - i went ahead and purged the few files that needed it [20:53:34] Hey all - we’ve got a handful of sec patches we’d like to get out during the deployment window today. Anything else still in progress right now? [20:53:51] sbassett: just finished the last patch - all yours [20:54:38] thanks cjming ! [20:54:47] hopefully that's the bulk of all the logo changes now for some time :) [20:54:55] tx, cjming [20:55:14] Jdlrobson: sounds good :) [20:55:19] !log end of UTC late backport window [20:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:59:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:00:05] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231016T2100). [21:04:44] !log deployed security mitigation for T348828 [21:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:55] (03CR) 10Ladsgroup: wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966213 (https://phabricator.wikimedia.org/T299947) (owner: 10Ladsgroup) [21:26:01] (03PS1) 10Eevans: echostore: update Kask image to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966301 (https://phabricator.wikimedia.org/T348647) [21:26:18] (03PS1) 10Eevans: sessionstore: update Kask image to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966302 (https://phabricator.wikimedia.org/T348647) [21:39:28] (03CR) 10Subramanya Sastry: Use Parsoid for all Wikis for Content Translation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930744 (https://phabricator.wikimedia.org/T339322) (owner: 10KartikMistry) [21:40:27] !log deployed security patch for T348343 [21:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:59] (03PS1) 10Bking: wdqs.data-reload: Add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/966303 (https://phabricator.wikimedia.org/T349011) [21:53:50] !log deployed security patch for T347708 [21:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:32] !log deployed security patch for T347742 [22:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:48] (03PS1) 10Ebernhardson: kafka-main: Allow connections from wikikube-staging [puppet] - 10https://gerrit.wikimedia.org/r/966308 (https://phabricator.wikimedia.org/T347075) [22:36:01] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44074/console" [puppet] - 10https://gerrit.wikimedia.org/r/966308 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [22:38:15] (03PS2) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) [22:42:17] (NodeTextfileStale) firing: (2) Stale textfile for puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:45:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:37:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:25] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10lmata) [23:46:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state