[00:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:21:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938343 [00:38:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938343 (owner: 10TrainBranchBot) [00:54:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938343 (owner: 10TrainBranchBot) [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342071 (10phaultfinder) [01:50:39] (03CR) 10Kaleem Bhatti: "anyone know why error showing what's problem how I can solve this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [01:54:13] (03CR) 10Tim Starling: "I think you should set $wgAutoCreateTempUser['serialProvider'] = 'centralauth';" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0200) [02:00:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:01:53] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:03:53] (03CR) 10Tim Starling: [C: 04-1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [02:07:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.18 [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938344 (https://phabricator.wikimedia.org/T340246) [02:07:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.18 [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938344 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [02:08:23] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:21] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.18 [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938344 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [02:29:22] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0300) [03:00:35] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:07] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:06:31] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:01] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:17] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:57] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:25] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:51] (03CR) 10Hashar: [C: 03+1] "There was an issue with some of the grants not having any effect and the issue was that multiple were used which does not wor" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [05:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:37:57] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:42:27] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:37] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0600) [06:00:04] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0600). [06:00:27] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/938222 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney) [06:00:33] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:05:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:54] (03CR) 10Elukey: [C: 03+1] ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [06:13:14] (03CR) 10Elukey: eventgate: set a more performant default for queue.buffering.max.ms (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [06:14:05] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:13] (03CR) 10Ayounsi: [C: 03+2] Add Python 3.11 support [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [06:16:08] (03Merged) 10jenkins-bot: Add Python 3.11 support [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [06:17:01] (03PS3) 10Ilias Sarantopoulos: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) [06:18:35] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:31] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [06:21:08] (03Merged) 10jenkins-bot: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [06:23:07] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:22] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:45] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:26] 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) p:05Triage→03Low [06:34:59] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) [06:35:07] 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) [06:36:42] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cr3-knams,cr3-knams IPv6 with reason: Downtime cr3-knams ahead of remote hands moving router [06:36:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cr3-knams,cr3-knams IPv6 with reason: Downtime cr3-knams ahead of remote hands moving router [06:48:39] !log disable asw-b-codfw:ae0 (to cloudsw1-b1-codfw) - T342076 [06:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:43] T342076: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 [06:53:13] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0700) [07:00:05] James_F: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:20] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) [07:16:46] !log restart kafka main-codfw rebalances (long maintenance) - T341558 [07:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:50] T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 [07:20:17] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:32] (03CR) 10Filippo Giunchedi: [C: 03+2] New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [07:24:41] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [07:24:45] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:54] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) a:03ayounsi [07:35:35] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:35:57] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:36:09] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:36:17] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:38:22] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:59] (03PS1) 10JMeybohm: deployment_server: Add helmfile value wmf_staging_environment [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) [07:42:56] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) 05Open→03Resolved I’m going to close this task for now. The problem has been mitigated as best as possible with the current equipment we have. In time replacing... [07:46:21] (03CR) 10Vgutierrez: [C: 03+1] "fix the commit message and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/938902 (owner: 10Fabfur) [07:47:45] (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938902 (https://phabricator.wikimedia.org/T340983) [07:49:08] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:31] (03PS1) 10Filippo Giunchedi: prometheus: fix duplicate declaration for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/939232 [07:52:34] (03CR) 10Ayounsi: [C: 03+1] CR: cloud-host: allow return traffic for PDNS servers [homer/public] - 10https://gerrit.wikimedia.org/r/938819 (https://phabricator.wikimedia.org/T341966) (owner: 10Arturo Borrero Gonzalez) [07:53:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix duplicate declaration for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/939232 (owner: 10Filippo Giunchedi) [07:55:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:59:01] this is me --^ [07:59:56] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [08:00:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:08:31] 10SRE-tools, 10Spicerack: Spicerack: don't write logs to disk - https://phabricator.wikimedia.org/T342079 (10ayounsi) [08:08:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [08:08:50] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [08:09:17] !log cr3-knams going offline for move [08:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:12] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 [08:11:48] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:48] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:12:04] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:12:06] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:12:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:12:54] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:13:22] !log disable puppet on A:cp-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938902/ (T340983) [08:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:25] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [08:14:25] (03CR) 10Fabfur: hiera: apply silent-drop on port 80 to drmrs cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938902 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:14:42] (03CR) 10Fabfur: [C: 03+2] hiera: apply silent-drop on port 80 to drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938902 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:16:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [08:17:47] !log enable puppet on A:cp-drmrs for https://gerrit.wikimedia.org/r/c/operations/puppet/+/938902/ (T340983) (hosts will run puppet with the usual schedule) [08:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [08:21:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:22:13] (03PS2) 10Alexandros Kosiaris: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [08:23:08] (03PS3) 10Alexandros Kosiaris: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [08:24:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [08:24:47] (03CR) 10Klausman: [C: 03+1] ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:24:54] (03CR) 10Klausman: [C: 03+1] ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 (owner: 10Ilias Sarantopoulos) [08:25:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [08:25:54] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [08:27:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [08:27:30] (03CR) 10DCausse: [C: 03+1] Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [08:28:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet [08:29:50] (03CR) 10Ayounsi: Manage TLS on network devices (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [08:30:33] (03CR) 10Ayounsi: [C: 03+2] Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [08:31:09] (03PS27) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [08:32:00] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:23] (03Merged) 10jenkins-bot: Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [08:33:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [08:34:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [08:34:45] (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) [08:36:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet [08:36:21] 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10ayounsi) [08:37:37] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [08:37:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet [08:39:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 9 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42512/console" [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:40:00] (03PS1) 10Filippo Giunchedi: base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) [08:40:12] (03CR) 10CI reject: [V: 04-1] base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:40:21] (03PS2) 10Filippo Giunchedi: base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) [08:40:59] (03PS5) 10Ilias Sarantopoulos: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [08:41:56] (03CR) 10Ilias Sarantopoulos: "It was just a styling error, it is now resolved!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [08:42:11] (03PS3) 10Filippo Giunchedi: base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) [08:45:56] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [08:46:20] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:46:36] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:36] (03CR) 10Vgutierrez: [C: 03+1] hiera: apply silent-drop on port 80 to eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:46:42] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:47:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet [08:48:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [08:48:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet [08:48:22] jouncebot: nowandnext [08:48:22] No deployments scheduled for the next 1 hour(s) and 11 minute(s) [08:48:22] In 1 hour(s) and 11 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1000) [08:48:28] cool [08:48:45] (03CR) 10Ladsgroup: [C: 03+2] ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:49:28] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:49:35] (03Merged) 10jenkins-bot: ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:50:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:50:30] (03PS11) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [08:53:15] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:937453|ores: use envoy proxy for Lift Wing (T319170)]] [08:53:18] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [08:54:58] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10hashar) > I tested with this commit https://gerrit.wikimedia.org/r/c/operations/dns/+/936686 -- it all worked perfectly. Looks like the Gerrit acces... [08:55:28] !log disable puppet on A:cp-eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/939235 (T340983) [08:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:34] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [08:56:09] (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: apply silent-drop on port 80 to eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:56:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet [08:56:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [08:57:33] !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:937453|ores: use envoy proxy for Lift Wing (T319170)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:58:37] !log enable puppet on A:cp-eqiad for https://gerrit.wikimedia.org/r/939235 (T340983) (hosts will run puppet with the usual schedule) [08:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:34] (03CR) 10Jbond: vrts: drop bashisms and fix other CI issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:01:39] (03PS2) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) [09:02:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet [09:02:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet [09:02:31] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) >>! In T320390#9018611, @Jelto wrote: > ... > There are two settings which we may test, one is `send_scope_to_token_endpoin... [09:02:39] (03CR) 10Jbond: [C: 03+2] vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:02:49] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [09:03:09] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 (owner: 10Ilias Sarantopoulos) [09:03:11] (03CR) 10Filippo Giunchedi: [C: 03+2] base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:03:39] (03Merged) 10jenkins-bot: ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [09:04:06] (03Merged) 10jenkins-bot: ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 (owner: 10Ilias Sarantopoulos) [09:04:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:44] (03PS2) 10Jbond: kerberos: fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064) [09:04:46] (03PS2) 10Jbond: kerberos: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064) [09:07:55] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:08:11] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:937453|ores: use envoy proxy for Lift Wing (T319170)]] (duration: 14m 56s) [09:08:12] (03CR) 10Jbond: [C: 03+2] kerberos: fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:08:14] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [09:08:18] (03CR) 10Jbond: [C: 03+2] kerberos: Fix bashisms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:09:28] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet [09:09:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:10:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet [09:11:58] 10SRE, 10Beta-Cluster-Infrastructure, 10Proton: Bump image version of Proton on Beta Cluster - https://phabricator.wikimedia.org/T342087 (10DAlangi_WMF) [09:14:54] (03PS12) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [09:15:19] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:16:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet [09:16:18] (03CR) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run (035 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [09:16:29] (03PS7) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [09:16:30] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:16:31] (03PS7) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [09:16:33] (03PS7) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [09:16:46] (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) [09:17:03] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:17:26] (03CR) 10CI reject: [V: 04-1] Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:17:48] (03CR) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:18:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet [09:20:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:20:40] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:21:32] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:21:45] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:24:10] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42513/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:24:32] !log remove asw-b1-codfw from asw-b-codfw VC - T342076 [09:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] T342076: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 [09:24:50] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet [09:24:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet [09:26:04] RECOVERY - Check systemd state on dumpsdata1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:33] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:28:09] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:28:25] (03CR) 10ArielGlenn: "New ppc output: https://puppet-compiler.wmflabs.org/output/938816/42511/" [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:28:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1001.eqiad.wmnet [09:28:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:28:55] (03CR) 10Kaleem Bhatti: [C: 03+1] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [09:29:13] (03PS6) 10Kaleem Bhatti: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) [09:29:53] (03PS8) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [09:29:54] PROBLEM - Check systemd state on ms-fe2013 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:54] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:30:04] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:30:11] the cadvisor unit failures is me [09:30:13] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:30:17] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42514/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:30:20] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:30:27] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:30:37] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:30:45] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:33:18] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:50] (03CR) 10Jbond: "thanks see responses inline" [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:34:12] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet [09:34:16] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42515/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:34:48] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:35:27] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) I looked at the GitLab `gitlabhq_production` database and `identities` table. I connected to the psql database using: `sud... [09:35:50] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10Joe) >>! In T341717#9010061, @jbond wrote: > I wonder if we should instead move config-master to a VM. AFAIK the... [09:37:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [09:38:17] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42516/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:39:16] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:40:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:41:25] (03CR) 10ArielGlenn: [C: 03+2] make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:43:40] 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) a:05ayounsi→03None [09:43:51] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:45:15] (03CR) 10Fabfur: [V: 03+1] hiera: apply silent-drop on port 80 to all cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:45:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:46:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10Joe) I think it makes sense to keep the conftool master on the same server as the puppetmaster, as we share reposito... [09:46:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] CR: cloud-host: allow return traffic for PDNS servers [homer/public] - 10https://gerrit.wikimedia.org/r/938819 (https://phabricator.wikimedia.org/T341966) (owner: 10Arturo Borrero Gonzalez) [09:47:59] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [09:48:10] A gate-and-submit job of wmf-quibble-vendor-mysql-php81-docker failed with some rsync permission errors on composer cache files: [09:48:11] rsync: [generator] failed to set times on "/cache/.": Operation not permitted (1) [09:48:14] (03PS2) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [09:48:17] rsync: [generator] recv_generator: mkdir "/cache/composer" failed: Permission denied (13) [09:48:23] rsync: [receiver] mkstemp "/cache/.phpcs.02e459ed8923.a8e9cc1cfb23.cache.JKuQZQ" failed: Permission denied (13) [09:48:31] etc [09:48:32] https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php81-docker/6481/console [09:48:49] is this likely a one-time thing and just flakyness or is there more going on? [09:50:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:51:01] (03PS2) 10JMeybohm: deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) [09:51:24] !log deploying https://gerrit.wikimedia.org/r/c/operations/homer/public/+/938819 via homer to cr-eqiad & cr-codfw [09:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:47] (03PS2) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) [09:51:49] (03PS2) 10Jbond: install_server: updaate to use bash [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) [09:51:51] (03PS2) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) [09:51:56] MichaelG_WMDE: I saw the same message a few times yesterday T341998 [09:51:56] T341998: Random CI builds failing with rsync errors - https://phabricator.wikimedia.org/T341998 [09:52:06] (03CR) 10Jbond: install_server: updaate to use bash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:52:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [09:52:25] (also, I think #wikimedia-releng is the better channel for that?) [09:52:37] (03PS2) 10Btullis: Upgrade the search instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933088 (https://phabricator.wikimedia.org/T336286) [09:52:39] !log disable puppet on A:cp-esams to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/939242 (T340983) [09:52:39] (03PS2) 10Btullis: Upgrade the research instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933089 (https://phabricator.wikimedia.org/T336286) [09:52:41] (03PS2) 10Btullis: Update the platform_eng airflow instance to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933090 (https://phabricator.wikimedia.org/T336286) [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:42] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [09:52:43] (03PS2) 10Btullis: Upgrade the analytics_product airflow instance to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933091 (https://phabricator.wikimedia.org/T336286) [09:53:23] (JobUnavailable) resolved: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:42] RECOVERY - Check systemd state on ms-fe2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:12] (03PS3) 10Btullis: Upgrade the research instance of airflow to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933089 (https://phabricator.wikimedia.org/T336286) [09:54:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42517/console" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:54:25] (03PS3) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) [09:54:27] (03PS3) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) [09:54:29] (03PS3) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) [09:54:35] (03PS3) 10Btullis: Update the platform_eng airflow instance to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933090 (https://phabricator.wikimedia.org/T336286) [09:55:05] (03PS3) 10Btullis: Upgrade the search instance of airflow to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933088 (https://phabricator.wikimedia.org/T336286) [09:57:27] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10fgiunchedi) [09:57:58] (03CR) 10Ayounsi: [C: 03+2] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [09:58:54] (03PS3) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [09:58:58] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [09:59:51] (03Merged) 10jenkins-bot: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1000) [10:00:33] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1071.eqiad.wmnet with OS bullseye [10:01:33] (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: apply silent-drop on port 80 to all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:02:11] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [10:02:17] (03CR) 10Ayounsi: [C: 03+2] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [10:02:25] !log enable puppet on A:cp-esams for https://gerrit.wikimedia.org/r/939235 (T340983) (hosts will run puppet with the usual schedule) [10:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:29] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [10:02:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet [10:02:58] !log fix last entry: correct CR is https://gerrit.wikimedia.org/r/939242 [10:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:49] (03Merged) 10jenkins-bot: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [10:05:43] (03PS1) 10ArielGlenn: Fix lookup ofsettig to enable sytmd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) [10:05:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [10:06:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:07:49] (03PS1) 10JMeybohm: envoy: Add -service-node argument to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939248 (https://phabricator.wikimedia.org/T300033) [10:08:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:09:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1002.eqiad.wmnet [10:10:40] (03PS1) 10JMeybohm: deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) [10:10:52] (03PS2) 10ArielGlenn: Fix lookup of setting to enable systemd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) [10:10:55] (03PS1) 10Jbond: kerberos: fix carriage return [puppet] - 10https://gerrit.wikimedia.org/r/939250 (https://phabricator.wikimedia.org/T95064) [10:11:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [10:11:46] (03CR) 10JMeybohm: "Depends on I8fd8c34091c5c0eca18f3ddbe094f87f8c248722" [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:13:34] 10ops-knams: Inbound interface errors - https://phabricator.wikimedia.org/T342097 (10phaultfinder) [10:15:50] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/939247/42519/ for ppc. I swear I get this lookup search path stuff wrong every single time." [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [10:16:03] 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Fabfur) [10:16:06] 10SRE, 10Traffic: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 (10Fabfur) 05Open→03Resolved a:03Fabfur The HAProxy configuration on all DCs has been updated to apply silent-drop to abusive clients hitting port 80, as been already done for port 443.... [10:17:15] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1071.eqiad.wmnet with reason: host reimage [10:17:23] (03CR) 10Btullis: [C: 03+1] Fix lookup of setting to enable systemd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [10:19:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:19:55] (03CR) 10ArielGlenn: [C: 03+2] Fix lookup of setting to enable systemd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [10:20:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1071.eqiad.wmnet with reason: host reimage [10:20:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:27] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >>! In T320390#9022500, @Jelto wrote: >>>! In T320390#9018611, @Jelto wrote: >> ... >> There are two settings which we may... [10:21:48] (03CR) 10Jbond: [C: 03+2] kerberos: fix carriage return [puppet] - 10https://gerrit.wikimedia.org/r/939250 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [10:24:42] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr3-knams,cr3-knams IPv6 [10:24:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr3-knams,cr3-knams IPv6 [10:24:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet [10:24:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet [10:27:22] (03PS1) 10Cathal Mooney: repool esams: router migration completed [dns] - 10https://gerrit.wikimedia.org/r/939253 (https://phabricator.wikimedia.org/T337997) [10:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1004.eqiad.wmnet [10:35:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [10:35:45] (03CR) 10Jbond: "lgtm see comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [10:36:46] (03CR) 10Ayounsi: [C: 03+1] repool esams: router migration completed [dns] - 10https://gerrit.wikimedia.org/r/939253 (https://phabricator.wikimedia.org/T337997) (owner: 10Cathal Mooney) [10:37:10] (03CR) 10Cathal Mooney: [C: 03+2] repool esams: router migration completed [dns] - 10https://gerrit.wikimedia.org/r/939253 (https://phabricator.wikimedia.org/T337997) (owner: 10Cathal Mooney) [10:38:09] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:38:32] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:38:50] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10jbond) > I think it's a perfectly valid idea, and I think it's relatively easy to do. We could just configure the... [10:38:52] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:39:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:04] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:39:12] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:40:10] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:41:11] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:42:11] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10jbond) > it can be done, but unless there's a huge compelling reason to decouple them, it seems too much work to me... [10:42:35] !log repool esams after successful move of cr3-knams to new rack T337997 [10:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:43:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:44:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:44] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] cert-manager: convert use of seed_image to image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935696 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [10:48:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [10:49:08] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10jbond) We should retest this when everything is on puppet7 [10:49:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:49:23] this is me --^ [10:49:31] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet [10:49:38] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:50:47] (03PS1) 10Giuseppe Lavagetto: Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) [10:54:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:54:38] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:59] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:55:19] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:55:35] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:55:45] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:55:50] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:56:47] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:56:48] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:55] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:58:33] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:03:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2002.codfw.wmnet [11:05:08] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet [11:07:05] (03PS1) 10Elukey: knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 [11:10:08] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1071.eqiad.wmnet with OS bullseye [11:15:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet [11:15:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet [11:16:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:20:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye [11:20:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1074.eqiad.wmnet with OS bullseye [11:22:13] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad [11:22:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad [11:22:32] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad [11:22:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad [11:24:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2004.codfw.wmnet [11:24:56] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:24:56] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [11:25:23] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:25:24] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [11:26:33] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) `name=SONiC refresh needed verbose ayounsi@cumin1001:~$ sudo cookbook -v sre.network.tls lsw1-e8-eqiad START - Cookbook sre.network.tls for network device lsw1-e8-eqiad... [11:27:07] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad [11:27:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad [11:27:13] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:27:13] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [11:30:25] (03PS1) 10Jbond: WIP: Add check to look for violating hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/939260 (https://phabricator.wikimedia.org/T181971) [11:30:57] (03CR) 10CI reject: [V: 04-1] WIP: Add check to look for violating hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/939260 (https://phabricator.wikimedia.org/T181971) (owner: 10Jbond) [11:31:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099 (10jbond) [11:34:54] 10SRE, 10Beta-Cluster-Infrastructure, 10Proton: Bump image version of Proton on Beta Cluster - https://phabricator.wikimedia.org/T342087 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Updated in [[ https://horizon.wikimedia.org/project/instances/47f8bf1e-31bb-48a9-a8ad-c116e0ab6112/ | deployment-docker-pr... [11:37:42] (03PS1) 10Ayounsi: sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) [11:39:18] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:39:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f8-eqiad [11:39:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Add -service-node argument to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939248 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:39:37] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:39:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f8-eqiad [11:41:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "out of curiosity: why no setting for the ml-{eqiad,codfw} clusters?" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:41:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:44:29] (03CR) 10JMeybohm: [V: 03+1] deployment_server: Add certmanager defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:46:53] (03PS3) 10JMeybohm: deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) [11:46:56] (03PS2) 10JMeybohm: deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) [11:47:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:48:26] 10SRE, 10Beta-Cluster-Infrastructure, 10Proton: Bump image version of Proton on Beta Cluster - https://phabricator.wikimedia.org/T342087 (10DAlangi_WMF) Yes, that's right. And I just tested on beta, it's working as expected. Thanks! [11:48:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42520/console" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:49:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall ok if hackish, please see the comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:50:07] (03CR) 10Klausman: [C: 03+1] knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 (owner: 10Elukey) [11:50:54] !log jbond@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:50:54] !log jbond@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [11:51:43] (03PS2) 10Ayounsi: sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) [11:55:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1015.eqiad.wmnet [11:56:12] (03PS4) 10JMeybohm: deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) [11:56:14] (03PS3) 10JMeybohm: deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) [11:58:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42521/console" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:01:18] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [12:01:25] (03CR) 10JMeybohm: [V: 03+1] kubernetes::master: Publish service-account cert to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:02:29] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Add -service-node argument to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939248 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:03:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [12:04:02] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [12:04:04] 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10aborrero) a:05aborrero→03Jclark-ctr [12:04:26] (03CR) 10JMeybohm: [C: 03+2] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:04:36] (03PS4) 10JMeybohm: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) [12:04:47] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1074.eqiad.wmnet with reason: host reimage [12:05:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [12:05:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1015.eqiad.wmnet [12:05:31] (03CR) 10JMeybohm: [C: 03+2] Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:06:00] (03Merged) 10jenkins-bot: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:06:23] (03PS8) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [12:07:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1074.eqiad.wmnet with reason: host reimage [12:08:51] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:09:29] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:09:43] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:13:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:14:37] I'm taking a look ^ [12:14:52] (03PS1) 10Daimona Eaytoy: prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) [12:16:03] sigh that's basically a sync on k8s ml causing the lag [12:16:15] (03PS2) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 [12:17:08] (03CR) 10JMeybohm: [C: 03+2] deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:17:11] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:17:14] (03PS2) 10D3r1ck01: chromium-render: Deploy latest proton build [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 [12:17:49] (03CR) 10JMeybohm: [C: 03+2] deployment_server: Bump default envoy image version to 1.23.10-2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:18:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:19:03] (03CR) 10CI reject: [V: 04-1] Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [12:19:08] (03CR) 10Hashar: Recognize ~/.config/docker-pkg.yaml (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [12:21:09] (03PS1) 10Daimona Eaytoy: private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) [12:21:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:36] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:23:53] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:24:53] (03PS1) 10Jforrester: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938683 (https://phabricator.wikimedia.org/T342099) [12:25:04] (03PS1) 10Jforrester: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938684 (https://phabricator.wikimedia.org/T342099) [12:25:59] (03PS1) 10Jgiannelos: mobileapps: Add core parsoid HTML support config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939292 (https://phabricator.wikimedia.org/T339865) [12:28:17] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [12:30:12] (03PS1) 10Ilias Sarantopoulos: httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293 [12:30:17] (03Abandoned) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:30:25] (03PS9) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [12:31:26] (03PS10) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [12:32:15] (03PS2) 10Ilias Sarantopoulos: httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293 [12:32:23] (03CR) 10Jgiannelos: "This patch adds config to allow choosing which page HTML endpoint to use. Enables core page HTML on staging for starters." [deployment-charts] - 10https://gerrit.wikimedia.org/r/939292 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:34:44] (03PS11) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [12:37:08] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:37:46] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:37:59] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:39:59] (03PS12) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [12:40:23] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye [12:41:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:42:54] !log btullis@cumin1001 START - Cookbook sre.hosts.dhcp for host analytics1073.eqiad.wmnet [12:44:16] (03PS1) 10Marostegui: install_server: Reimage db2188-db2195 [puppet] - 10https://gerrit.wikimedia.org/r/939295 (https://phabricator.wikimedia.org/T341273) [12:44:46] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2188-db2195 [puppet] - 10https://gerrit.wikimedia.org/r/939295 (https://phabricator.wikimedia.org/T341273) (owner: 10Marostegui) [12:45:13] (03CR) 10Klausman: [C: 03+1] httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293 (owner: 10Ilias Sarantopoulos) [12:46:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:53:06] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) >>! In T320390#9022808, @jbond wrote: > > > The [[ https://docs.gitlab.com/ee/integration/omniauth.html#link-existing-use... [12:54:40] (03CR) 10Jgiannelos: "For future reference, you don't need to bump the chart version for non chart specific changes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01) [12:54:49] (03CR) 10Jgiannelos: [C: 03+1] chromium-render: Deploy latest proton build [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01) [12:55:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1074.eqiad.wmnet with OS bullseye [12:57:21] (03PS2) 10Jbond: config-master: drop ssh-fingerprints.txt file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) [12:57:23] (03PS9) 10Jbond: ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1300). [13:00:04] James_F and Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1300) [13:00:04] xSavitar: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:23] o/ [13:00:33] I’m in a meeting, sorry [13:00:42] (03PS10) 10Jbond: ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [13:00:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [13:01:18] (03CR) 10Jbond: "Ready for review, i have squashed the changes and updated how we publish known_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [13:02:09] I'm here. [13:02:20] * TheresNoTime can't deploy today, sorry [13:02:31] I can do it. [13:02:35] (03CR) 10D3r1ck01: [C: 03+2] chromium-render: Deploy latest proton build (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01) [13:02:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) (owner: 10Jforrester) [13:03:10] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2022/2023-Q4): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) 05Open→03In progress p:05Triage→03High [13:03:16] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [13:03:22] (03Merged) 10jenkins-bot: chromium-render: Deploy latest proton build [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01) [13:04:14] (03Merged) 10jenkins-bot: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) (owner: 10Jforrester) [13:04:31] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] [13:04:36] !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [13:04:40] T147219: Wikipedia logo in Notification popup is not high-density ready - https://phabricator.wikimedia.org/T147219 [13:04:40] Daimona: Did you want to do yours yourself? [13:04:57] I'm not a deployer :) [13:05:05] Oh, really? We should fix that. :-) [13:05:42] xSavitar: BTW, we should really move the Content Transform service window to not clash… [13:05:58] Working as intended for now :P [13:06:02] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:06:12] !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:06:36] xSavitar, maybe! I'll ask Tyler or RelEng. But it seems it's okay to do both concurrently? [13:06:55] The whole point of the calendar is that we shouldn't ever have concurrent windows. :-) [13:07:02] !log derick@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:07:09] It doesn't break things too often, but… [13:07:16] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [13:07:19] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [13:07:30] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [13:07:46] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [13:07:52] (03CR) 10Ayounsi: [C: 03+2] sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:08:38] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet'] [13:08:46] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet'] [13:08:48] James_F, got it. Should we poke Rel-Eng? I can file a task if don't mind. [13:09:18] !log derick@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:09:36] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond) [13:09:40] xSavitar: The trick is finding a time that works for you – would an hour earlier or later work? [13:09:52] xSavitar: If so, I can just write a patch moving the window now. [13:10:27] (03Merged) 10jenkins-bot: sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:10:58] !log derick@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [13:12:18] James_F, an hour earlier works for me but I'm not the entire CT team. But I can bring this up in the CT team channel and ask opinions about moving the window up by 1 hour. [13:12:26] Thanks! [13:12:27] !log derick@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:13:12] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] (duration: 08m 40s) [13:13:15] T147219: Wikipedia logo in Notification popup is not high-density ready - https://phabricator.wikimedia.org/T147219 [13:13:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:13:42] (03PS7) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [13:13:46] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:14:00] (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:14:12] (03Abandoned) 10Jbond: config-master: drop ssh-fingerprints.txt file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [13:14:58] (03CR) 10CI reject: [V: 04-1] Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:15:49] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [13:15:55] * James_F sighs. [13:15:58] * xSavitar done with deployment. Service still works as expected. [13:16:04] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [13:17:02] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] [13:17:08] (03PS2) 10Ayounsi: Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) [13:17:08] !log jforrester@deploy1002 sync-world aborted: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] (duration: 00m 06s) [13:17:22] * James_F sighs at network issues during a deploy. Sorry all. [13:17:23] (03CR) 10Ayounsi: Add cookbook to manage users SSH keys on SONiC devices (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [13:17:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:17:56] (03CR) 10Jforrester: [C: 03+2] "Let's just merge this manually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:18:06] (03Merged) 10jenkins-bot: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:18:11] Finally. [13:18:27] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:771622|Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy (T275945)]] [13:18:31] T275945: Launch Wikifunctions - https://phabricator.wikimedia.org/T275945 [13:20:02] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:771622|Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy (T275945)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:20:29] !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided) [13:20:33] !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 03s) [13:20:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [13:21:30] Daimona: Do you also need me to add a key on Beta Cluster, or are you doing that // it's not installed there? [13:21:43] It was done for beta a couple weeks ago [13:21:49] (03CR) 10Ayounsi: [C: 03+2] Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [13:21:50] Aha, excellent. [13:22:03] (03CR) 10Elukey: [C: 03+2] httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293 (owner: 10Ilias Sarantopoulos) [13:22:10] (T320258) [13:22:11] T320258: Dashboard integration: Configure the P&E Dashboard integration in beta - https://phabricator.wikimedia.org/T320258 [13:22:32] * James_F nods. [13:24:11] (03PS1) 10Ladsgroup: spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302 [13:24:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [13:24:47] (03Merged) 10jenkins-bot: Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [13:25:26] (03PS2) 10Ladsgroup: spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302 [13:25:42] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/939302 (owner: 10Ladsgroup) [13:26:19] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:771622|Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy (T275945)]] (duration: 07m 52s) [13:26:23] T275945: Launch Wikifunctions - https://phabricator.wikimedia.org/T275945 [13:26:55] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host analytics1073.eqiad.wmnet [13:28:21] (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 (owner: 10Elukey) [13:29:14] (03CR) 10Elukey: [C: 03+2] knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 (owner: 10Elukey) [13:30:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938821 (owner: 10Volans) [13:30:22] * James_F twiddles thumbs. [13:30:31] Daimona: Sorry each scap takes so long. [13:30:45] That's fine :) [13:31:53] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9019625, @Vgutierrez wrote: > @ayounsi @cmooney could you let DCops know which racks would be better for these boxes? Thanks! I am on-site this week in eqiad. Can I get... [13:32:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42523/console" [puppet] - 10https://gerrit.wikimedia.org/r/937522 (https://phabricator.wikimedia.org/T341721) (owner: 10Jbond) [13:32:37] (03PS2) 10Jforrester: prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy) [13:33:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy) [13:33:33] Daimona: You OK to test ^ when it lands on a canary? [13:33:51] Yup [13:33:59] (03Merged) 10jenkins-bot: prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy) [13:34:06] Ace. [13:34:16] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:939286|prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance (T320260)]] [13:34:27] T320260: Dashboard integration: Configure the P&E Dashboard integration in prod - https://phabricator.wikimedia.org/T320260 [13:35:17] !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided) [13:35:21] !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 03s) [13:35:45] !log jforrester@deploy1002 daimona and jforrester: Backport for [[gerrit:939286|prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance (T320260)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:37:36] Daimona: It's live to test. [13:38:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/937522 (https://phabricator.wikimedia.org/T341721) (owner: 10Jbond) [13:38:47] Thanks! [13:38:50] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.3 [software/homer] - 10https://gerrit.wikimedia.org/r/939303 [13:39:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:40:05] xSavitar: Made https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/35 to shift the window an hour earlier. [13:40:08] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:40:22] (03PS2) 10Jforrester: private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy) [13:41:31] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10RobH) a:05Jclark-ctr→03Eevans @eevans, I'm on-site in EQIAD this week (through Friday). John has already designated where these will land: >>! In T307035#8565052, @... [13:41:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:42:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi) [13:42:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:42:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:43:13] (03PS4) 10Herron: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [13:43:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:43:24] (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.6.3 [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi) [13:43:30] !log upload python3-conftool_2.2.2-1 [13:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:40] !log upload python3-conftool_2.2.2-1+deb12u1 [13:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:06] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.3 [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi) [13:46:17] (03CR) 10Jbond: promethus: switch to using cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [13:47:05] Daimona: Still testing? OK to sync? [13:47:14] Still testing, almost done [13:47:21] Ack. [13:47:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:28] Testing done, looks great! [13:49:43] Cool. [13:50:30] (03CR) 10Herron: [C: 03+2] promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [13:50:39] And then finally just a doc-only one. [13:51:46] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) @RobH There is no real preference on my side. I would say pick one rack from E1/E2/E3/F1/F2/F3 and put the first 3 of them in that one, then place lvs1016 in a different rack from th... [13:52:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:33] James_F, thanks! Will share on CT Slack. I already asked there and whatever is decided, then we'll move forward. [13:54:18] xSavitar: Thank you! [13:55:08] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:23] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10RobH) [13:55:35] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:939286|prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance (T320260)]] (duration: 21m 19s) [13:55:39] T320260: Dashboard integration: Configure the P&E Dashboard integration in prod - https://phabricator.wikimedia.org/T320260 [13:55:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy) [13:55:54] James_F, for nothing sir. Yiannis is fine with the change but he said if it's okay with EU tz friendly. So I'll give you feedback once everyone settles. :) [13:56:22] (03Merged) 10jenkins-bot: private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy) [13:56:25] 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) [13:56:32] * James_F nods. [13:56:39] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:939288|private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret (T320260)]] [13:56:39] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1001.eqiad.wmnet [13:56:41] !log bking@cumin1001 START - Cookbook sre.dns.netbox [13:58:09] 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10Jdforrester-WMF) Looking at the SAL, possible fall-out from T337997? [13:58:21] !log jforrester@deploy1002 jforrester and daimona: Backport for [[gerrit:939288|private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret (T320260)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:00:08] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:09] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) Thanks @cmooney, @Fabfur will take care of running the decom cookbook (thanks!) [14:01:19] PROBLEM - Host lsw1-f2-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:01:20] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) We were unaware that these moves would require an IP change (and by extension/recommendation a reimage). There is more than 2TB of data (per host) that would have t... [14:02:37] PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:03:07] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10Joe) [14:04:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1013.eqiad.wmnet [14:04:44] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:939288|private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret (T320260)]] (duration: 08m 04s) [14:04:47] T320260: Dashboard integration: Configure the P&E Dashboard integration in prod - https://phabricator.wikimedia.org/T320260 [14:05:19] !log asw2-esams# set interfaces xe-4/0/4 disable - T342121 [14:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:23] T342121: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 [14:06:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:06:20] OK, finally, all done. 65 minutes for 5 patches. :-( [14:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:29] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:09:25] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) p:05Triage→03Medium [14:10:00] (03PS1) 10Ayounsi: Release v0.6.3 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939306 [14:10:17] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [14:10:33] (03Abandoned) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [14:12:15] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [14:16:27] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [14:16:27] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:27] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1001.eqiad.wmnet on all recursors [14:16:30] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1001.eqiad.wmnet on all recursors [14:16:33] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:17:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:55] (03PS3) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 [14:18:30] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [14:19:15] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond) [14:19:15] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [14:19:15] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:15] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1001.eqiad.wmnet on all recursors [14:19:18] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1001.eqiad.wmnet on all recursors [14:19:25] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1001.eqiad.wmnet [14:19:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond) 05Open→03In progress p:05Triage→03Medium [14:19:35] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [14:21:03] (03CR) 10Jbond: [C: 03+1] Release v0.6.3 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939306 (owner: 10Ayounsi) [14:21:27] (03CR) 10Ayounsi: [C: 03+2] Release v0.6.3 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939306 (owner: 10Ayounsi) [14:21:34] (03PS1) 10Jelto: gitlab: auto link existing users with OIDC [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) [14:22:09] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [14:22:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42524/console" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [14:22:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [14:22:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:22:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1013.eqiad.wmnet [14:23:15] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1013.eqiad.wmnet` - lvs1013.eqiad.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager... [14:23:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:23:58] (03CR) 10CI reject: [V: 04-1] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:24:37] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42525/console" [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [14:24:46] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) lvs1013.eqiad.wmnet has been decommissioned via cookbook @Tue 18 Jul 2023 02:24:10 PM UTC [14:25:05] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [14:25:14] (03CR) 10Hashar: Recognize ~/.config/docker-pkg.yaml (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [14:25:30] (03PS13) 10Giuseppe Lavagetto: api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:27:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:27:35] (03PS1) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) [14:28:13] (03Merged) 10jenkins-bot: api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [14:29:33] (03PS2) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) [14:29:40] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1014.eqiad.wmnet [14:30:33] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:31:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:31:49] (03PS2) 10Ayounsi: Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) [14:32:26] (03PS3) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) [14:33:09] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:33:19] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:33:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42528/console" [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [14:34:14] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:34:56] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [14:35:24] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:35:45] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:36:56] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.3 - ayounsi@cumin1001 [14:36:58] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [14:37:06] (03CR) 10Eevans: "To be clear: The component/cassandra311 repository has been updated to 3.11.14, making this changeset a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans) [14:37:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [14:37:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1014.eqiad.wmnet [14:38:02] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1014.eqiad.wmnet` - lvs1014.eqiad.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager... [14:38:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [14:38:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.3 - ayounsi@cumin1001 [14:38:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [14:39:07] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [14:39:34] (03CR) 10Ayounsi: [C: 03+2] Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [14:40:05] (03Merged) 10jenkins-bot: Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [14:40:09] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [14:42:33] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10cmooney) @TheDJ thanks for reporting this, indeed it does not look right and was an oversight by myself after we re-pooled esams earlier today. We did some work earlier moving equipment in one of our... [14:44:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:23] !log dns2004 upgrade to pdns-rec 4.8.4: T341611 [14:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:26] T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 [14:49:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:13] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) 05Open→03Resolved a:03TheDJ Thank you, seems fixed now indeed. [14:53:50] PROBLEM - Host db1198 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:54:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1015.eqiad.wmnet [14:54:17] ^^^ [14:54:31] marostegui: is this known [14:54:47] !log robh@cumin1001 START - Cookbook sre.dns.netbox [14:54:56] nop [14:54:57] checking [14:55:02] I'll ack the page [14:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1198', diff saved to https://phabricator.wikimedia.org/P49571 and previous config saved to /var/cache/conftool/dbconfig/20230718-145529-root.json [14:55:30] * jbond dissconnects from mgmt port [14:55:31] Depooled [14:55:33] Thanks herron [14:55:35] (03PS1) 10JMeybohm: deployment_server: Fix structure for certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939311 (https://phabricator.wikimedia.org/T300033) [14:55:42] I am going to create a task so we can follow up there [14:55:48] marostegui: ack thanks [14:55:55] sounds good marostegui [14:56:55] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013 relocation - robh@cumin1001" [14:57:14] (03PS1) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) [14:57:17] Thanks both [14:57:35] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42529/console" [puppet] - 10https://gerrit.wikimedia.org/r/939311 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:57:50] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013 relocation - robh@cumin1001" [14:57:50] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:13] (03CR) 10JHathaway: install_server: drop Bashisms (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [14:58:40] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Fix structure for certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939311 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:58:41] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10ayounsi) [14:58:47] (03PS1) 10Marostegui: db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/939313 (https://phabricator.wikimedia.org/T342129) [15:00:17] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Memory issues: ` Record: 15 Date/Time: 01/19/2023 16:23:12 Source: system Severity: Critical Description: Multi-bit memory errors are detected on the memory device at location(s) D... [15:00:20] (03CR) 10Marostegui: [C: 03+2] db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/939313 (https://phabricator.wikimedia.org/T342129) (owner: 10Marostegui) [15:00:33] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs1013.mgmt.eqiad.wmnet with reboot policy FORCED [15:01:03] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [15:01:19] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Actually I realised those errors are old [15:01:59] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) There are no more errors on the idrac - @Jclark-ctr can you check it onsite? The host seems to be unreachable [15:02:19] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1015.eqiad.wmnet [15:02:30] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1015.eqiad.wmnet` - lvs1015.eqiad.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager... [15:03:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:38] RECOVERY - Host db1198 #page is UP: PING WARNING - Packet loss = 90%, RTA = 0.28 ms [15:03:39] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Jclark-ctr) a:03Jclark-ctr @Marostegui looking at it now [15:03:44] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Thanks! [15:03:51] PROBLEM - SSH on db1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:04:47] RECOVERY - SSH on db1198 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:05:36] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [15:07:16] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Jclark-ctr) 05Open→03Resolved @Marostegui Replaced cable link is back up now [15:08:14] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host lvs1013.mgmt.eqiad.wmnet with reboot policy FORCED [15:09:07] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Thanks John! I can reach the server now! It looks like it indeed didn't crash, the uptime is 70 days and MySQL is also up. Logs confirms it was a network issue: ` [Tue Jul 18 14:48:46 2023] tg3 0... [15:11:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [15:12:38] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans) [15:12:40] (03CR) 10BCornwall: [C: 03+2] roll-restart-wikimedia-dns: Add reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [15:13:01] 10ops-eqiad, 10DBA: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Jclark-ctr) @Marostegui Sorry i did close task if you want to reopen it. i was able to duplicate loosing link with cable. replaced sfp-t and cable [15:13:31] 10ops-eqiad, 10DBA: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Don't worry, no need to reopen :) [15:13:41] (03CR) 10Herron: [C: 03+1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:14:08] (03CR) 10Herron: [C: 03+1] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:14:37] (03CR) 10Herron: [C: 03+1] logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:15:17] (03CR) 10Herron: [C: 03+1] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:16:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42530/console" [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:18:03] (03PS1) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) [15:18:12] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye [15:18:19] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye [15:18:54] (03CR) 10JHathaway: [C: 04-1] monitoring: fix bashisms and other minor lint issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [15:19:57] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) [15:20:24] (03CR) 10CI reject: [V: 04-1] puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:20:45] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bullseye [15:20:50] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye executed with errors: - lvs1013 (**FAIL**) - Removed from Pup... [15:21:11] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1013 [15:21:13] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1013 [15:22:12] (03CR) 10JHathaway: [C: 03+1] kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [15:22:52] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye [15:22:59] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye [15:23:22] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1016.eqiad.wmnet [15:23:36] (03PS13) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [15:23:38] (03PS1) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) [15:24:08] (03CR) 10CI reject: [V: 04-1] CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:24:10] (03CR) 10Jsn.sherman: beta: log additional click events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman) [15:24:12] (03CR) 10CI reject: [V: 04-1] Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:25:38] (03PS2) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) [15:26:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42532/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:28:30] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [15:30:18] (03PS1) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/939326 (https://phabricator.wikimedia.org/T301944) [15:31:03] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [15:31:24] (03PS2) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) [15:31:26] (03PS14) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [15:31:36] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [15:31:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [15:31:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:31:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1016.eqiad.wmnet [15:32:02] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1016.eqiad.wmnet` - lvs1016.eqiad.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager... [15:33:53] (03PS3) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) [15:34:39] (03CR) 10JHathaway: ssh: switch to using the same file we use in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [15:34:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42533/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:36:52] (03PS4) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) [15:37:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42535/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:39:04] (03PS5) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) [15:39:55] (03CR) 10Eevans: [C: 03+2] cassandra: transition 3.11.14 from 'dev' to '3.x' [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans) [15:40:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42536/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:41:05] (03CR) 10Giuseppe Lavagetto: noc: add script to dump etcd db config (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [15:41:24] (03PS5) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) [15:41:26] (03PS5) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) [15:42:21] 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10Papaul) [15:43:26] (03PS3) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) [15:43:28] (03PS15) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [15:43:30] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939318 (https://phabricator.wikimedia.org/T340246) [15:43:32] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939318 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [15:44:14] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939318 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [15:44:40] !log dancy@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.18 refs T340246 [15:44:43] T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246 [15:46:24] 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10Papaul) Onsite work complete on asw-b1-codfw [15:46:29] (03PS4) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) [15:48:32] (03CR) 10JMeybohm: ".fixtures/service_proxy.yaml can actually be removed now, but I had issues doing so because the git revert code obviously recreates it as " [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:48:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:11] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [15:54:09] (03CR) 10JMeybohm: [C: 03+1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [15:55:11] (03PS5) 10Cwhite: Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) [15:55:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:59:41] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:43] (03CR) 10JHathaway: install_server: drop Bashisms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [16:00:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1600). [16:00:05] dancy and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] o/ [16:00:10] !log robh@cumin1001 START - Cookbook sre.dns.netbox [16:01:00] 10ops-eqiad: analytics1073 loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) [16:01:29] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bullseye [16:01:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:34] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye executed with errors: - lvs1013 (**FAIL**) - Removed from Pup... [16:01:54] 10ops-eqiad, 10Infrastructure-Foundations: analytics1073 loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) [16:02:32] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs10145 relocation - robh@cumin1001" [16:02:45] (03CR) 10Kaleem Bhatti: "merge please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [16:02:57] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) [16:03:10] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1014 [16:03:16] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs10145 relocation - robh@cumin1001" [16:03:16] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:03:17] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1014 [16:03:21] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1015 [16:03:27] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1015 [16:04:17] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) [16:04:46] (03PS1) 10Jbond: puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) [16:05:03] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:13] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10bking) Was thinking a bit more about this...would it work to do some minimal sanity-checking on the DNS changes (such as t... [16:05:19] (03CR) 10CI reject: [V: 04-1] puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:05:52] jbond: Are you handling the puppet window today? [16:07:00] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) i think this will ultmatly be solved by adding locking support to cookbooks, see T341973 [16:08:07] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) > It looks like there is work in progess to add locking to cookbooks , which would be an acceptable workaround. ind... [16:08:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) [16:08:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10jbond) [16:08:59] dancy: sorry I had an interview run long, looking now! [16:09:04] Thanks! [16:09:50] any preferred order, or do em both at once? [16:10:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:16] 938931 first, then a quick test, then 938939 [16:10:23] 👍 [16:10:37] (03PS2) 10Jbond: puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) [16:10:38] on gitlab-runner1003, yeah? [16:10:56] (03CR) 10RLazarus: [C: 03+2] Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:11:04] 938931 applies to all of the WMCS and trusted runners (there are several) [16:11:25] Likewise for 938939 [16:11:43] sure, I can run puppet on all of them if you like [16:11:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42538/console" [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:11:52] Yes please. [16:12:20] is this the right set? gitlab-runner[2002-2004].codfw.wmnet,gitlab-runner[1002-1004].eqiad.wmnet [16:12:30] I think you're on your own for the WMCS ones [16:12:58] (03PS1) 10DCausse: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939327 [16:13:02] (03PS2) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) [16:13:14] (03PS3) 10Jbond: puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) [16:13:53] (03PS1) 10DCausse: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939328 [16:13:54] rzl: yes that's the right set for the trusted runners [16:14:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42539/console" [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:14:21] I can wait for the regular puppet runs for the WMCS runners. [16:15:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:15:32] done, except puppet failed on gitlab-runner2003.codfw.wmnet, having a look [16:16:09] oh, it just didn't get the lock because a regular run was already in progress [16:16:11] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:14] cool 👍 go ahead and test [16:16:26] ok.. in progress. [16:18:30] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Logstash: implement availability SLO (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite) [16:19:41] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:44] (03PS1) 10Jbond: puppetserver: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/939324 [16:19:50] dancy: ^ fyi [16:20:32] hmmm [16:21:11] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:22:39] (03PS3) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) [16:23:11] rzl: Looks like that was a one-off glitch (a problem communicating with dockerd?). Subsequent runs of the service seem to be ok. [16:23:52] good enough for me [16:24:00] should I go ahead with the next patch? [16:24:17] Yes. First phase of testing passed. Ready for the next. [16:24:26] (03CR) 10RLazarus: [C: 03+2] Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:24:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1075.eqiad.wmnet with OS bullseye [16:25:23] puppet's running now [16:26:13] and done, over to you [16:28:22] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) 05Open→03Resolved We have updated the SLI to an availability. Changes are applied to the dash... [16:28:38] rzl: Second test completed. Thanks for deploying! [16:28:40] !log maintenance finished for kafka main-codfw [16:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:48] rad, thanks! [16:29:52] (03PS1) 10Ayounsi: Aerleon: workaround regression with includes [homer/public] - 10https://gerrit.wikimedia.org/r/939325 (https://phabricator.wikimedia.org/T337082) [16:30:42] (03CR) 10Ayounsi: [C: 03+2] Aerleon: workaround regression with includes [homer/public] - 10https://gerrit.wikimedia.org/r/939325 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [16:30:55] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.18 refs T340246 (duration: 46m 15s) [16:31:01] T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246 [16:31:37] (03Merged) 10jenkins-bot: Aerleon: workaround regression with includes [homer/public] - 10https://gerrit.wikimedia.org/r/939325 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [16:33:09] !log dancy@deploy1002 Pruned MediaWiki: 1.41.0-wmf.16 (duration: 02m 11s) [16:34:10] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF) [16:34:49] (03PS2) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) [16:36:35] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF) I'm marking this as high because Thomas will need the access in order to be able to start supporting Ops work starting week... [16:36:41] (03PS3) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) [16:36:47] (03PS1) 10Btullis: Upgrade the analytics instance of airflow to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/939347 (https://phabricator.wikimedia.org/T336286) [16:37:05] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:39:06] (03CR) 10CI reject: [V: 04-1] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:39:09] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) p:05Triage→03High [16:39:41] (03CR) 10CI reject: [V: 04-1] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:39:57] (03PS4) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) [16:42:42] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10JAllemandou) Indeed, being part of Data Engineering team, Thomas will be in charge during his ops-week time to restart jobs as the `ana... [16:44:24] (03PS5) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) [16:44:30] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) Happily, I can also approve this change, as per: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933976 I'll merge this a... [16:44:35] (03CR) 10Btullis: [C: 03+2] Update approvers for analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/933976 (owner: 10Ottomata) [16:45:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42543/console" [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:45:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:46:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:48:42] (03PS1) 10Btullis: Add tchin to the analytics-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/939348 (https://phabricator.wikimedia.org/T342146) [16:49:02] (03PS1) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) [16:50:40] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) p:05Triage→03High I have created: https://gerrit.wikimedia.org/r/939348 [16:56:52] (03CR) 10Jdlrobson: "Jan: per standup can you run tests locally and compare results with Mo on the ticket. Thanks in advance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1700) [17:01:39] !log dancy@deploy1002 Installing scap version "4.55.0" for 605 hosts [17:02:35] !log dancy@deploy1002 Installation of scap version "4.55.0" completed for 605 hosts [17:04:02] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough-drmrs and A:wikidough [17:07:20] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [17:07:34] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:52] ^ expected [17:07:56] (03PS6) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) [17:08:14] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:08:52] (03PS1) 10Jdlrobson: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) [17:09:01] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10lmata) [17:09:02] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:42] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:10:12] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10lmata) p:05Triage→03Medium [17:10:20] (03CR) 10Jdlrobson: "FYI. I'll backport this today. Intentionally doing this on English Wikipedia but not wmf18 so we get a couple of days of data since the er" [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [17:10:37] (03PS1) 10Jdlrobson: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) [17:13:18] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) a:03BTullis [17:14:45] 10SRE, 10Observability-Alerting: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101 (10lmata) Understood, we will make a note of it in our backlog and carefully evaluate it when the opportunity arises [17:14:56] (03PS1) 10Jdlrobson: Fixes: Mobile login watermark large and uncentered [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) [17:16:22] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:16:25] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1002.eqiad.wmnet [17:16:26] !log bking@cumin1001 START - Cookbook sre.dns.netbox [17:16:36] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:06] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:19:09] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [17:19:20] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:19:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough-drmrs and A:wikidough [17:19:55] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [17:19:55] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:19:55] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1002.eqiad.wmnet on all recursors [17:19:58] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1002.eqiad.wmnet on all recursors [17:20:23] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [17:20:36] (03CR) 10Ssingh: "So we were testing this today and observed that when the host reboots, Puppet is still disabled. And because it is disabled, it won't find" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [17:20:50] (03PS1) 10Jelto: gitlab_runner: disable unprivileged_userns [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) [17:21:06] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [17:25:18] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1002.eqiad.wmnet with OS bookworm [17:25:25] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm [17:27:05] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:28:46] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10lmata) p:05Triage→03Medium [17:29:01] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016 relocation - robh@cumin1001" [17:29:08] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10lmata) [17:29:46] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016 relocation - robh@cumin1001" [17:29:47] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:23] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10lmata) Since this is part of core work would this be better fitted as "high" priority? [17:30:28] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1016 [17:30:36] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1016 [17:30:58] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) [17:33:58] (03CR) 10Jelto: "I found two more references here:" [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [17:40:10] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42544/console" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [17:42:23] (03CR) 10Andrea Denisse: [C: 03+1] Add tchin to the analytics-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/939348 (https://phabricator.wikimedia.org/T342146) (owner: 10Btullis) [17:45:06] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1075.eqiad.wmnet with OS bullseye [17:46:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [17:47:49] (03CR) 10KaleemBot: [C: 03+1] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [17:52:56] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm. Do you have a ldap_group_sync_bot_token I can put into private puppet? I can also create a token myself if you have a bot-user/proje" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [17:56:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:57:04] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10andrea.denisse) Hi @Mpossoupe , could you please fill the details of your request using the [[ https://phabricator.wikimedia.org/project/profile/1564/ | LDAP-Access-Requests templates ]] and tag your manage... [17:57:11] (03CR) 10Ahmon Dancy: Run LDAP group sync periodically on gitlab replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [17:57:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:57:25] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10andrea.denisse) a:03andrea.denisse [17:58:08] (03CR) 10Jelto: [C: 03+1] "lgtm, will merge this on Thursday if that's fine for you Antoine" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [18:00:04] dancy and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1800). [18:01:56] 🚂 [18:02:11] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939358 (https://phabricator.wikimedia.org/T340246) [18:02:16] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939358 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [18:03:10] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939358 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [18:03:36] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: disable unprivileged_userns [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) (owner: 10Jelto) [18:03:46] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) Hi @andrea.denisse , Noted. Will do and let you know. Thanks [18:09:23] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:03] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.18 refs T340246 [18:10:07] T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246 [18:16:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1002.eqiad.wmnet with OS bookworm [18:16:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1002.eqiad.wmnet [18:16:32] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm executed w... [18:28:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [18:28:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) p:05Triage→03Medium [18:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:38:27] (03CR) 10Gergő Tisza: IP Masking: Enable for cswiki beta (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [18:40:42] (03CR) 10Gergő Tisza: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [18:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:43:30] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [18:51:01] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [18:51:18] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 17s) [18:54:41] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [18:54:46] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [18:57:20] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [18:57:39] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 18s) [19:08:12] 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) [19:10:35] 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) a:03ssingh Please note parent task 341588 has the range of cp1[090-105] however, cp1090 is already live/in use. Additionally, we have 4 cp hosts from eqsin to use for CP in eqiad (so c... [19:10:44] 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) [19:10:59] 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) [19:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:21:55] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10RobH) [19:22:01] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10RobH) [19:34:31] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10RobH) [19:34:47] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10RobH) [19:37:31] (03PS7) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [19:37:33] (03PS6) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [19:38:13] (03CR) 10CI reject: [V: 04-1] [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [19:39:58] (03PS7) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [19:45:14] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10RobH) [19:45:26] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10RobH) [19:49:11] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet'] [19:49:13] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet'] [19:49:42] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [19:50:15] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [19:52:48] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [19:53:06] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet'] [19:53:33] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1075.eqiad.wmnet'] [19:53:53] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1075.eqiad.wmnet'] [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T2000). Please do the needful. [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:45] i can deploy today [20:00:47] hi Jdlrobson [20:00:59] here [20:01:46] (03CR) 10Urbanecm: [C: 03+2] Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:02:31] Jdlrobson: your comment on the Popups patch says "doing this on English Wikipedia but not wmf18", but a wmf.18 patch is in the calendar as well. is that intentional? [20:02:46] yes that's intentional [20:02:58] (03CR) 10Urbanecm: [C: 03+2] Fixes: Mobile login watermark large and uncentered [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) (owner: 10Jdlrobson) [20:03:19] okay, so +2'ing the other patch too then. [20:03:22] (03CR) 10Urbanecm: [C: 03+2] Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:03:57] (03PS1) 10Jbond: puppetserver: use FQDN in metric [puppet] - 10https://gerrit.wikimedia.org/r/939362 (https://phabricator.wikimedia.org/T342125) [20:04:07] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10RobH) [20:04:18] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10RobH) [20:04:29] (03PS7) 10Urbanecm: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [20:04:35] (03CR) 10Urbanecm: [C: 03+2] Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [20:05:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42545/console" [puppet] - 10https://gerrit.wikimedia.org/r/939362 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [20:05:51] (03Merged) 10jenkins-bot: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [20:06:24] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:937480|Deploy new logos (T341260 T341243 T341912)]] [20:06:32] T341243: Design: Get icons for remaining Wiktionary, Wikiversity, Wikibooks projects - https://phabricator.wikimedia.org/T341243 [20:06:32] T341912: Update knwikisource logos - https://phabricator.wikimedia.org/T341912 [20:06:32] T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260 [20:06:46] (03Merged) 10jenkins-bot: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:07:52] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:937480|Deploy new logos (T341260 T341243 T341912)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:08:09] Jdlrobson: your config patch is at mwdebug1001, can you test? :) [20:08:21] yep [20:10:22] urbanecm: LGTM [20:10:30] proceeding [20:16:14] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:937480|Deploy new logos (T341260 T341243 T341912)]] (duration: 09m 50s) [20:16:20] and done [20:16:21] T341243: Design: Get icons for remaining Wiktionary, Wikiversity, Wikibooks projects - https://phabricator.wikimedia.org/T341243 [20:16:21] T341912: Update knwikisource logos - https://phabricator.wikimedia.org/T341912 [20:16:22] T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260 [20:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:17:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:17:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) (owner: 10Jdlrobson) [20:17:19] (03Merged) 10jenkins-bot: Fixes: Mobile login watermark large and uncentered [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) (owner: 10Jdlrobson) [20:17:23] (03Merged) 10jenkins-bot: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:17:54] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:939329|Add additional debugging closest bug (T340081)]], [[gerrit:939330|Add additional debugging closest bug (T340081)]], [[gerrit:939331|Fixes: Mobile login watermark large and uncentered (T341812)]] [20:17:59] T341812: Mobile login watermark large and uncentered - https://phabricator.wikimedia.org/T341812 [20:17:59] T340081: TypeError: n.closest is not a function - https://phabricator.wikimedia.org/T340081 [20:19:28] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:939329|Add additional debugging closest bug (T340081)]], [[gerrit:939330|Add additional debugging closest bug (T340081)]], [[gerrit:939331|Fixes: Mobile login watermark large and uncentered (T341812)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes de [20:19:28] ployment (accessible via k8s-experimental XWD option) [20:19:45] Jdlrobson: all three backports are at mwdebug1001, can you test them now please? [20:20:01] looking :) [20:21:16] all of them LGTM [20:21:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:36] great, proceeding [20:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:26:53] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1002.eqiad.wmnet [20:26:54] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:28:21] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10andrea.denisse) Hi! I see this task in the SRE Clinic Duty Triage, feel free to let me know if you would like me to help with creating the VMs. :) [20:28:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:939329|Add additional debugging closest bug (T340081)]], [[gerrit:939330|Add additional debugging closest bug (T340081)]], [[gerrit:939331|Fixes: Mobile login watermark large and uncentered (T341812)]] (duration: 10m 28s) [20:28:27] T341812: Mobile login watermark large and uncentered - https://phabricator.wikimedia.org/T341812 [20:28:27] T340081: TypeError: n.closest is not a function - https://phabricator.wikimedia.org/T340081 [20:28:27] Jdlrobson: and deployed [20:28:34] anything else? [20:28:43] thanks urbanecm ill keep an eye on logstash. Should need 10 mins to double check everything is good [20:29:06] sounds good. feel free to ping me should a revert become necessary. [20:29:07] (03CR) 10Gergő Tisza: IP Masking: Enable for cswiki beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [20:30:04] urbanecm: looks like it's working to me [20:30:07] awesome [20:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:30:47] urbanecm: hm.. it does look a lot higher than expected thoug [20:30:56] higher? [20:31:22] urbanecm: yeh i think we might need to follow up or revert. [20:31:35] hang on [20:31:38] waiting [20:32:06] yeh :( [20:32:08] at least for enwiki for now [20:32:16] I can follow up on the deployment branch [20:33:23] I'd prefer a revert, unless the fix is very easy -- B&C is usually not intended for code review. [20:33:41] ive got a follow up ready [20:33:47] the codes in the wrong place [20:33:51] it should be moved down [20:33:58] so a trivial bug on my part :/ [20:34:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:30] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/939364 [20:35:19] ^ urbanecm how do you feel about backporting that? [20:35:25] i can get it merged to master today [20:35:29] just might take longer than 30 mins [20:35:51] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/939364/2/src/index.js is the significant part of the change [20:36:03] Jdlrobson: let's try backporting [20:36:13] we can try enwiki first [20:36:21] 1.41.0-wmf.17 [20:36:29] if that works then we'll put it on 1.41.0-wmf.18 [20:36:41] (03PS1) 10Jdlrobson: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) [20:36:57] sounds good [20:37:04] (03CR) 10Urbanecm: [C: 03+2] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:39:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:39:35] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [20:40:38] (03CR) 10CI reject: [V: 04-1] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:40:49] Jdlrobson: ^^^ [20:41:07] i guess let's restart, as it's selenium? [20:41:12] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [20:42:22] main passed, restarting... [20:42:23] (03CR) 10Urbanecm: [C: 03+2] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:42:57] (03PS1) 10Jdlrobson: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) [20:43:09] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [20:43:10] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:43:10] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1002.eqiad.wmnet on all recursors [20:43:13] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1002.eqiad.wmnet on all recursors [20:43:15] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:46:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:08] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10RobH) [20:47:37] (03Merged) 10jenkins-bot: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [20:47:55] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10RobH) [20:48:00] !log btullis@cumin1001 START - Cookbook sre.hosts.dhcp for host analytics1073.eqiad.wmnet [20:50:53] 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [20:52:53] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:939332|Don't log for documentElement (nodeType 9) (T340081)]] [20:52:57] T340081: TypeError: n.closest is not a function - https://phabricator.wikimedia.org/T340081 [20:54:10] 10SRE, 10Observability-Metrics: node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768 (10lmata) [20:54:24] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:939332|Don't log for documentElement (nodeType 9) (T340081)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:54:38] Jdlrobson: finally on mwdebug. can you test? [20:55:38] if not urbanecm yep [20:56:33] https://en.wikipedia.org/wiki/Main_Page looking good.. [20:57:14] so, let's proceed then [20:59:38] (03CR) 10Btullis: [C: 03+2] Add tchin to the analytics-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/939348 (https://phabricator.wikimedia.org/T342146) (owner: 10Btullis) [21:00:35] (03CR) 10AOkoth: [C: 03+2] vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [21:02:52] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [21:02:55] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:939332|Don't log for documentElement (nodeType 9) (T340081)]] (duration: 10m 01s) [21:02:58] T340081: TypeError: n.closest is not a function - https://phabricator.wikimedia.org/T340081 [21:03:28] so, deployed. [21:03:31] let's do .18 then? [21:03:34] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001" [21:03:34] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:03:34] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1002.eqiad.wmnet on all recursors [21:03:37] urbanecm: just verifying the volume goes down [21:03:37] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1002.eqiad.wmnet on all recursors [21:03:42] sure [21:03:44] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1002.eqiad.wmnet [21:03:54] group 0 / 1 is not a problem for volume as I dont think any projects run page previews [21:04:07] ack [21:04:16] group 0 sorry [21:04:21] hewiki and cawiki run it and are in group 1 [21:04:52] so far so good... https://usercontent.irccloud-cdn.com/file/KDQ2brpU/Screenshot%202023-07-18%20at%202.04.41%20PM.png [21:05:07] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) @tchin that is now done. Welcome to the `analytics-admins` group. [21:05:47] 👍 [21:06:51] ok yep this looks good [21:06:54] we can backport the other one [21:07:01] im seeing the tail :) [21:08:23] great, let's go for it [21:08:49] (03CR) 10Urbanecm: [C: 03+2] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [21:08:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [21:10:50] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [21:10:51] !log bking@cumin1001 START - Cookbook sre.dns.netbox [21:13:18] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:13:57] (03Merged) 10jenkins-bot: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson) [21:14:03] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:14:03] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:14:04] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [21:14:07] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [21:14:25] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:939333|Don't log for documentElement (nodeType 9) (T340081)]] [21:14:29] T340081: TypeError: n.closest is not a function - https://phabricator.wikimedia.org/T340081 [21:14:33] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:15:18] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:15:27] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm [21:15:34] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [21:15:37] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install X - https://phabricator.wikimedia.org/T342176 (10RobH) [21:15:55] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10RobH) [21:15:56] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:939333|Don't log for documentElement (nodeType 9) (T340081)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:16:29] proceeding, additional testing seems unnecessary [21:16:38] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10RobH) [21:19:43] urbanecm: thanks [21:22:08] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:939333|Don't log for documentElement (nodeType 9) (T340081)]] (duration: 07m 42s) [21:22:12] T340081: TypeError: n.closest is not a function - https://phabricator.wikimedia.org/T340081 [21:22:21] Jdlrobson: and should be all done [21:22:28] anything else i can help with today? [21:28:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host analytics1073.eqiad.wmnet [21:28:13] thanks urbanecm so sorry this overran [21:28:21] no worries [21:28:47] 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10RobH) [21:29:17] (03CR) 10Cwhite: [C: 03+2] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:32:30] (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:36:08] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) I've upgraded: * the iDRAC version * the NIC firmware * the BIOS I tried two versions of the NIC firmware, in case it was t... [21:46:41] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342071 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue [21:51:38] (03CR) 10Subramanya Sastry: Set default for UseLegacyMediaStyles and disable on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra) [21:52:30] (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:59:24] (03PS1) 10Subramanya Sastry: Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) [22:00:05] (03CR) 10C. Scott Ananian: Set default for UseLegacyMediaStyles and disable on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra) [22:00:55] (03CR) 10Subramanya Sastry: "To be merged as part of the backport window tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry) [22:01:09] (03CR) 10C. Scott Ananian: [C: 03+1] "agreed w/ diagnosis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry) [22:06:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm [22:06:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [22:07:00] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [22:07:21] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10MediaWiki-extensions-LdapAuthentication, and 2 others: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10Pppery) [22:09:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:04] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [22:12:05] !log bking@cumin1001 START - Cookbook sre.dns.netbox [22:16:50] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [22:18:33] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [22:18:34] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:18:34] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [22:18:37] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [22:18:39] !log bking@cumin1001 START - Cookbook sre.dns.netbox [22:23:25] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [22:24:10] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [22:24:10] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:24:10] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [22:24:13] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [22:24:20] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [22:28:15] (03CR) 10Cwhite: [C: 03+2] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:31:43] (03CR) 10BCornwall: [C: 03+2] "After-the-fact bug report created:" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall) [22:32:29] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [22:32:30] !log bking@cumin1001 START - Cookbook sre.dns.netbox [22:34:21] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:34:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [22:44:28] !log brett@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on P{doh5002*} and A:wikidough [22:44:40] (03CR) 10Mabualruz: [C: 03+1] Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [22:46:31] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis we had the same issue with sessionstore2001 in codw see task below what we did was to replace the 1G RJ45/SFP convert... [22:48:35] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10lmata) [22:49:20] 10SRE, 10Observability-Logging: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10lmata) [22:50:35] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops, and 2 others: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350 (10lmata) [22:50:51] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10lmata) [22:51:05] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on P{doh5002*} and A:wikidough [22:51:57] 10SRE, 10Observability-Alerting: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10lmata) [22:52:05] (03CR) 10BCornwall: [V: 03+1] "This was run successfully on doh5002: Adding disable_puppet_on_reboot = True behaved as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [22:52:15] (03CR) 10BCornwall: [V: 03+1] "This was run successfully on doh5002: Adding disable_puppet_on_reboot = True behaved as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [22:52:43] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10MediaWiki-extensions-LdapAuthentication, and 2 others: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) 05Open→03Resolved a:03taavi Verified my fix works on labtestwikitech, it'll roll out to wikitech... [22:56:56] (03PS3) 10Mabualruz: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [23:11:00] 10SRE-OnFire, 10Incident Tooling: Provide mechanism to join/leave oncall - https://phabricator.wikimedia.org/T322636 (10lmata) [23:16:12] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10lmata) [23:19:13] 10SRE, 10Observability-Alerting, 10observability: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10lmata) [23:19:57] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10lmata) [23:22:02] (03CR) 10Jdlrobson: "Kim: do you need help backporting this change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [23:22:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10lmata) [23:24:44] 10SRE, 10Traffic, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10lmata) [23:25:34] 10SRE, 10Incident Tooling: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata)