[00:03:16] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:20] PROBLEM - Check systemd state on puppetmaster1004 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:22] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:28] PROBLEM - Check systemd state on puppetmaster2004 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:02] PROBLEM - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:14] PROBLEM - Check systemd state on puppetmaster2003 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:30] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:38] PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_fact_cleanup.service,puppet_report_cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:54] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service,man-db.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:22] all at once because it's midnight UTC [00:12:16] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:13:00] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:13:46] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.245 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:14:00] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:14:02] PROBLEM - Juniper alarms on cr3-esams is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 91.198.174.245 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [00:14:12] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:14:25] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) @jhathaway Hi there, I just tried with both of my LDAP credential and the gerrit one, none of them working, the LDAP one give me "Authentication attempt has failed, likely due to inval... [00:15:26] RECOVERY - Juniper alarms on cr3-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [00:15:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:17:08] ^ on it. [00:17:29] denisse: is it netmon related? [00:17:33] thanks [00:17:59] I am looking at the aphlict1001 alert [00:18:00] mutante: no, bit it's one of the servers my team maintains and I recently rebooted it. [00:18:10] alright, ty [00:18:26] denisse: "keyholder arm" and the passphrase from pwstore [00:18:30] most likely [00:19:04] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:48] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 106.96 ms [00:20:22] 10SRE, 10LDAP-Access-Requests: Logstash Access for wfan - https://phabricator.wikimedia.org/T325334 (10AnnWF) [00:20:42] 10SRE, 10LDAP-Access-Requests: Logstash Access for Wfan - https://phabricator.wikimedia.org/T325334 (10AnnWF) [00:21:20] !log aphlict1001 - :/var/log/aphlict# truncate aphlict.log --size 100M - T325246 [00:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:24] T325246: aphlict1001 - logrotate or disk space - https://phabricator.wikimedia.org/T325246 [00:21:56] !log aphlict1001 - systemctl start logrotate - T325246 [00:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:38] !log aphlict1001 - systemctl start man-db - T325246 [00:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:04] 10SRE, 10LDAP-Access-Requests: Logstash Access for Wfan - https://phabricator.wikimedia.org/T325334 (10AnnWF) [00:23:26] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:08] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:27:59] 10SRE, 10LDAP-Access-Requests: Logstash Access for Wfan - https://phabricator.wikimedia.org/T325334 (10AnnWF) 05Open→03Resolved [00:31:26] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:52] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:33:32] (03CR) 10Dzahn: "0:03 <+icinga-wm> PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet_f" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [00:40:11] (03PS1) 10Dzahn: puppetmaster: temp absent new puppet clean timers [puppet] - 10https://gerrit.wikimedia.org/r/868481 [00:40:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:41:51] (03PS2) 10Dzahn: puppetmaster: temp absent new puppet clean timers [puppet] - 10https://gerrit.wikimedia.org/r/868481 [00:42:15] (03CR) 10Dzahn: "disabled in https://gerrit.wikimedia.org/r/c/operations/puppet/+/868481" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [00:42:35] (03CR) 10Dzahn: [C: 03+2] puppetmaster: temp absent new puppet clean timers [puppet] - 10https://gerrit.wikimedia.org/r/868481 (owner: 10Dzahn) [00:43:00] (03PS3) 10Dzahn: puppetmaster: temp absent new puppet cleanup timers [puppet] - 10https://gerrit.wikimedia.org/r/868481 [00:44:46] (03CR) 10Dzahn: [V: 03+2] puppetmaster: temp absent new puppet cleanup timers [puppet] - 10https://gerrit.wikimedia.org/r/868481 (owner: 10Dzahn) [00:45:48] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:06] RECOVERY - Check systemd state on puppetmaster1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:34] RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:50] RECOVERY - Check systemd state on puppetmaster1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:52] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:52] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:02] RECOVERY - Host mr1-esams.oob is UP: PING WARNING - Packet loss = 71%, RTA = 107.00 ms [00:50:04] RECOVERY - Check systemd state on puppetmaster2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:34] RECOVERY - Check systemd state on puppetmaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:50] RECOVERY - Check systemd state on puppetmaster2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:53] !log puppetmasters - merged gerrit:868481 to "revert" gerrit:866644,ran puppet and 'systemctl reset-failed' via cumin on 10 masters, resolved monitoring alerts [00:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:00] (03CR) 10Dzahn: "00:51 < mutante> !log puppetmasters - merged gerrit:868481 to "revert" gerrit:866644,ran puppet and 'systemctl reset-failed' via cumin on " [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [00:54:20] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:00:40] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:06] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [01:05:20] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:05:36] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:12] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:07:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:08:42] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 90.10 ms [01:31:08] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) >>! In T216815#8472047, @Andrew wrote: > Huh, is anyone tasked with this? This is one of the few cases that's keeping Stretch alive in cloud-vps and prod. See #thu... [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:48] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Quiddity) @Ladsgroup > Something along the lines: […] Perfect draft. Thank you. :) [[https://meta.wikimedia.org/wiki/Tech/News/2022/51|Added]]. [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:40] jouncebot: nowandnext [02:00:40] No deployments scheduled for the next 5 hour(s) and 59 minute(s) [02:00:40] In 5 hour(s) and 59 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221216T0800) [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:25:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:25:56] !issync [02:25:58] Syncing #wikimedia-operations (requested by legoktm) [02:26:00] Set /cs flags #wikimedia-operations TheresNoTime +Aiotv [03:02:26] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:03:56] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:34:47] (03PS1) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) [05:58:58] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:13] (03PS1) 10Marostegui: misc.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868511 (https://phabricator.wikimedia.org/T325154) [06:24:11] (03CR) 10Marostegui: [C: 03+2] misc.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868511 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [06:25:18] (03CR) 10Marostegui: [C: 03+1] mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [06:26:58] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:55] (03PS1) 10Marostegui: wikireplicas.my.cnf: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868512 (https://phabricator.wikimedia.org/T325154) [06:36:28] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [06:36:46] (03CR) 10Marostegui: [C: 03+2] wikireplicas.my.cnf: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868512 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [06:37:56] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:41:30] (03PS1) 10Marostegui: core_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868513 (https://phabricator.wikimedia.org/T325154) [06:43:33] (03CR) 10Marostegui: [C: 03+2] core_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868513 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [07:07:37] (03PS1) 10Marostegui: install_server: Install db2185-db2187 [puppet] - 10https://gerrit.wikimedia.org/r/868515 (https://phabricator.wikimedia.org/T325210) [07:09:35] (03CR) 10Marostegui: [C: 03+2] install_server: Install db2185-db2187 [puppet] - 10https://gerrit.wikimedia.org/r/868515 (https://phabricator.wikimedia.org/T325210) (owner: 10Marostegui) [07:17:09] (03PS1) 10Marostegui: install_server: Allow install db1207-db1229 [puppet] - 10https://gerrit.wikimedia.org/r/868516 (https://phabricator.wikimedia.org/T325209) [07:17:46] (03CR) 10Marostegui: [C: 03+2] install_server: Allow install db1207-db1229 [puppet] - 10https://gerrit.wikimedia.org/r/868516 (https://phabricator.wikimedia.org/T325209) (owner: 10Marostegui) [07:35:39] (03PS1) 10Marostegui: db2185-db2187: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/868518 (https://phabricator.wikimedia.org/T325210) [07:38:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:43:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:44:02] (03CR) 10Marostegui: [C: 03+2] db2185-db2187: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/868518 (https://phabricator.wikimedia.org/T325210) (owner: 10Marostegui) [07:55:21] (03CR) 10Ilias Sarantopoulos: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221216T0800) [08:03:08] (03Abandoned) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez) [08:04:40] (03PS1) 10Slyngshede: C:idm::deployment fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/868615 [08:05:40] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment fix logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/868615 (owner: 10Slyngshede) [08:10:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org [08:13:21] 10SRE, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10ayounsi) Not long ago I wrote an internship project proposal for this task, putting it here so it doesn't get forgotten in a Google Doc. --- **Problem statement:** Part of clinic duty is to mai... [08:14:16] 10SRE, 10Infrastructure-Foundations, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10MoritzMuehlenhoff) [08:16:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org [08:18:23] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host bast2002.wikimedia.org [08:20:57] 10SRE-OnFire, 10Discovery-Search, 10Observability-Alerting, 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10Gehel) 05Open→03Resolved a:03Gehel We do have an alert for missing masters [08:25:38] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2002.wikimedia.org [08:35:04] !log power down ganeti5003 manually (mgmt/IPMI broken) for pending decom T322048 [08:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:08] T322048: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 [08:37:57] PROBLEM - Host ganeti5003 is DOWN: PING CRITICAL - Packet loss = 100% [08:41:34] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [08:41:36] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [08:42:26] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [08:43:57] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [08:45:01] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti5003.eqsin.wmnet [08:45:10] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [08:45:35] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [08:50:52] (03PS3) 10JMeybohm: Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) [08:51:54] (03PS1) 10Muehlenhoff: Remove ganeti5003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/868617 (https://phabricator.wikimedia.org/T322048) [08:51:59] (03CR) 10JMeybohm: Update cert-manager to 1.10.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [08:53:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:54:28] (03PS1) 10David Caro: novafullstack: don't crash if got error cleaning up some VMs [puppet] - 10https://gerrit.wikimedia.org/r/868618 (https://phabricator.wikimedia.org/T322279) [08:56:46] (03PS2) 10Muehlenhoff: Remove ganeti5003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/868617 (https://phabricator.wikimedia.org/T322048) [09:01:11] (03CR) 10Elukey: [C: 03+1] Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:01:21] (03PS1) 10David Caro: idp: declare missing enable_webauthn false param in cloud [puppet] - 10https://gerrit.wikimedia.org/r/868619 [09:04:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:05:41] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti5003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/868617 (https://phabricator.wikimedia.org/T322048) (owner: 10Muehlenhoff) [09:09:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:11:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti5003.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:12:13] (03PS1) 10Ayounsi: Remove include for 10.132.129.X [dns] - 10https://gerrit.wikimedia.org/r/868620 [09:13:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti5003.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:13:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti5003.eqsin.wmnet [09:13:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti5003.eqsin.wmnet` - ganeti5003.eqsin.wmnet (**FAIL**) - Downti... [09:15:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/868620 (owner: 10Ayounsi) [09:15:19] (03CR) 10Ayounsi: [C: 03+2] Remove include for 10.132.129.X [dns] - 10https://gerrit.wikimedia.org/r/868620 (owner: 10Ayounsi) [09:15:59] (03PS2) 10Muehlenhoff: an-web: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866337 (https://phabricator.wikimedia.org/T135991) [09:16:04] (03PS2) 10Muehlenhoff: piwik: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866329 (https://phabricator.wikimedia.org/T135991) [09:18:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:19:02] (03PS1) 10Slyngshede: C:idm::deployment add RQ and database settings. [puppet] - 10https://gerrit.wikimedia.org/r/868621 [09:19:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:20:03] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment add RQ and database settings. [puppet] - 10https://gerrit.wikimedia.org/r/868621 (owner: 10Slyngshede) [09:21:58] (03PS1) 10JMeybohm: cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) [09:22:56] (03CR) 10CI reject: [V: 04-1] cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:23:01] (03CR) 10Muehlenhoff: C:idm::deployment add RQ and database settings. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868621 (owner: 10Slyngshede) [09:27:20] 10SRE, 10LDAP-Access-Requests: Logstash Access for Wfan - https://phabricator.wikimedia.org/T325334 (10Aklapper) [09:27:44] (03PS1) 10Muehlenhoff: Readd thirdparty/terraform components [puppet] - 10https://gerrit.wikimedia.org/r/868623 (https://phabricator.wikimedia.org/T322344) [09:27:53] 10SRE, 10LDAP-Access-Requests: Logstash Access for Wfan - https://phabricator.wikimedia.org/T325334 (10Aklapper) (Best to add process updates as new comments and not into the task description) [09:30:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] novafullstack: don't crash if got error cleaning up some VMs [puppet] - 10https://gerrit.wikimedia.org/r/868618 (https://phabricator.wikimedia.org/T322279) (owner: 10David Caro) [09:32:09] (03CR) 10Muehlenhoff: [C: 03+2] Readd thirdparty/terraform components [puppet] - 10https://gerrit.wikimedia.org/r/868623 (https://phabricator.wikimedia.org/T322344) (owner: 10Muehlenhoff) [09:38:12] (03PS2) 10David Caro: novafullstack: don't crash if got error cleaning up some VMs [puppet] - 10https://gerrit.wikimedia.org/r/868618 (https://phabricator.wikimedia.org/T322279) [09:39:42] (03CR) 10David Caro: [C: 03+2] novafullstack: don't crash if got error cleaning up some VMs [puppet] - 10https://gerrit.wikimedia.org/r/868618 (https://phabricator.wikimedia.org/T322279) (owner: 10David Caro) [09:44:50] (03CR) 10Ayounsi: P:installserver::proxy: add ability to proxy ssh ports (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond) [09:44:55] !log import terraform 1.3.6 to thirdparty/terraform for buster/bullseye T322344 [09:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:00] T322344: Move cloud runner CI jobs to trusted runners - https://phabricator.wikimedia.org/T322344 [09:52:33] (03PS1) 10Slyngshede: C:idm::deployment fqdn -> service_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/868625 [09:56:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:20] (03CR) 10Jbond: [C: 03+1] "looks good as is but possible suggestion inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 (owner: 10Volans) [10:01:22] (03CR) 10Jbond: First stab at possible ferm::qos resource for DSCP marking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:01:56] (03CR) 10Gehel: [C: 03+1] Set role_contacts for apifeatureusage::logstash [puppet] - 10https://gerrit.wikimedia.org/r/863329 (owner: 10Muehlenhoff) [10:01:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:02:17] (03CR) 10Jbond: [C: 03+1] P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458 (owner: 10Ssingh) [10:06:45] (03CR) 10Volans: [C: 03+2] base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:08:34] (03PS9) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [10:09:02] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment fqdn -> service_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/868625 (owner: 10Slyngshede) [10:11:13] (03CR) 10Volans: [C: 03+2] base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:11:26] (03CR) 10Muehlenhoff: [C: 03+2] Set role_contacts for apifeatureusage::logstash [puppet] - 10https://gerrit.wikimedia.org/r/863329 (owner: 10Muehlenhoff) [10:11:48] volans: I'll merge your patch along? [10:11:51] moritzm: if you have my patch [10:11:53] go ahead [10:11:54] :D [10:12:03] I tried but you got the lock first [10:12:21] he ran out of 0.8 FTE time before [10:12:26] merged :-) [10:12:29] lol [10:14:06] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:37] (03CR) 10Ayounsi: Example strategy for marking DSCP with ferm and puppet integration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:20:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:25:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) [10:31:46] (03CR) 10Elukey: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [10:35:44] (03PS1) 10David Caro: metricsinfra: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/868631 [10:38:35] (03PS1) 10Volans: cloudcumin: use the puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) [10:43:22] (03PS2) 10JMeybohm: cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) [10:44:08] (03CR) 10CI reject: [V: 04-1] cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [10:45:53] (03PS27) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [10:46:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: introduce role skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:48:49] (03PS1) 10David Caro: alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 [10:49:27] (03CR) 10Ayounsi: First stab at possible ferm::qos resource for DSCP marking (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [10:51:03] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2001-dev.codfw.wmnet with OS bullseye [10:52:39] (03CR) 10Jbond: [C: 03+1] "lgtm, question inline" [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:55:21] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:59:32] (03PS1) 10Volans: cloudcumin: actually allow ssh from the masters [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) [11:05:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for SDelbecque - https://phabricator.wikimedia.org/T324753 (10SDelbecque-WMF) thanks! [11:10:22] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:51] (03PS1) 10David Caro: karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) [11:16:01] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage [11:17:26] (03CR) 10David Caro: "Tested this locally with a docker image of karma (v0.99, like prod)." [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [11:19:07] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage [11:19:28] (03CR) 10Jbond: [C: 03+1] cloudcumin: use the puppetdb microservice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:20:19] (03PS1) 10Muehlenhoff: Remove email addresses for absented users [puppet] - 10https://gerrit.wikimedia.org/r/868639 [11:20:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:22:21] (03PS2) 10Volans: cloudcumin: actually allow ssh from the masters [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) [11:22:24] (03PS2) 10Volans: cloudcumin: use the puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) [11:22:36] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:22:55] (03PS3) 10Volans: cloudcumin: actually allow ssh from the masters [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) [11:23:11] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:26:14] (03CR) 10Volans: [C: 03+2] cloudcumin: use the puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/868632 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:31:36] (03PS4) 10Volans: cloudcumin: actually allow ssh from the masters [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) [11:31:44] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:36:33] (03CR) 10Volans: [C: 03+2] cloudcumin: actually allow ssh from the masters [puppet] - 10https://gerrit.wikimedia.org/r/868636 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:37:06] (03CR) 10Muehlenhoff: [C: 03+2] Remove email addresses for absented users [puppet] - 10https://gerrit.wikimedia.org/r/868639 (owner: 10Muehlenhoff) [11:38:54] (03PS1) 10Marostegui: production-backup1*: Add orchestrator grants [puppet] - 10https://gerrit.wikimedia.org/r/868641 [11:40:51] (03CR) 10Jbond: [C: 03+1] idp: declare missing enable_webauthn false param in cloud [puppet] - 10https://gerrit.wikimedia.org/r/868619 (owner: 10David Caro) [11:44:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:47:26] (03PS2) 10Muehlenhoff: openstack/codfw1dev: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860903 (https://phabricator.wikimedia.org/T308013) [11:48:54] (03PS1) 10Volans: cr-labs: allow SSH from the cloudcumin_group [homer/public] - 10https://gerrit.wikimedia.org/r/868646 (https://phabricator.wikimedia.org/T319401) [11:49:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:51:26] (03CR) 10Muehlenhoff: [C: 03+1] "Sorry, I forgot about the cloud setup!" [puppet] - 10https://gerrit.wikimedia.org/r/868619 (owner: 10David Caro) [11:53:01] (03CR) 10Muehlenhoff: [C: 03+2] openstack/codfw1dev: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860903 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:54:07] (03PS2) 10Muehlenhoff: lvs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863298 (https://phabricator.wikimedia.org/T308013) [11:54:39] (03PS2) 10Muehlenhoff: cache::kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863301 (https://phabricator.wikimedia.org/T308013) [11:57:18] (03PS2) 10Volans: cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 [11:57:53] (03CR) 10Jcrespo: production-backup1*: Add orchestrator grants (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868641 (owner: 10Marostegui) [11:58:35] (03CR) 10Ayounsi: [C: 03+1] cr-labs: allow SSH from the cloudcumin_group [homer/public] - 10https://gerrit.wikimedia.org/r/868646 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:58:42] (03PS2) 10Marostegui: production-backup1*: Add orchestrator grants [puppet] - 10https://gerrit.wikimedia.org/r/868641 [11:58:55] (03CR) 10Marostegui: production-backup1*: Add orchestrator grants (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868641 (owner: 10Marostegui) [11:59:27] (03CR) 10Jcrespo: [C: 03+1] production-backup1*: Add orchestrator grants [puppet] - 10https://gerrit.wikimedia.org/r/868641 (owner: 10Marostegui) [12:00:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 28398 [12:00:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28398 [12:00:59] (03CR) 10Marostegui: [C: 03+2] production-backup1*: Add orchestrator grants [puppet] - 10https://gerrit.wikimedia.org/r/868641 (owner: 10Marostegui) [12:01:03] (03CR) 10Volans: [C: 03+2] cr-labs: allow SSH from the cloudcumin_group [homer/public] - 10https://gerrit.wikimedia.org/r/868646 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:01:24] (03CR) 10Vgutierrez: [C: 03+1] lvs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863298 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:01:38] (03Merged) 10jenkins-bot: cr-labs: allow SSH from the cloudcumin_group [homer/public] - 10https://gerrit.wikimedia.org/r/868646 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:01:57] (03CR) 10Vgutierrez: [C: 03+1] cache::kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863301 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:07:07] (03CR) 10Muehlenhoff: [C: 03+2] lvs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863298 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:09:09] (03CR) 10Muehlenhoff: [C: 03+2] cache::kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863301 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:09:12] !log run homer on cr[1-2]-{eqiad,codfw} to allow SSH from cloudcumin hosts to cloud hosts [12:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:06] (03PS2) 10Muehlenhoff: Add SPDX headers to various base/IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/863305 (https://phabricator.wikimedia.org/T308013) [12:13:56] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:22] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: service: fix typo in srange specification [puppet] - 10https://gerrit.wikimedia.org/r/868651 (https://phabricator.wikimedia.org/T324992) [12:24:36] (03PS1) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 [12:24:38] (03PS1) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 [12:26:20] (03PS1) 10Volans: cumin::cloud_target: read also the cloud_cumin key [puppet] - 10https://gerrit.wikimedia.org/r/868655 (https://phabricator.wikimedia.org/T323483) [12:26:37] (03PS7) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:27:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38843/console" [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond) [12:27:58] (03CR) 10CI reject: [V: 04-1] wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 (owner: 10Jbond) [12:28:28] (03CR) 10Volans: "addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 (owner: 10Volans) [12:29:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:installserver::proxy: add ability to proxy ssh ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond) [12:32:36] !log deploy update to webproxy https://gerrit.wikimedia.org/r/c/operations/puppet/+/868372 [12:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:19] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: specify acme-chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/868656 [12:37:25] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/868656/38844/" [puppet] - 10https://gerrit.wikimedia.org/r/868656 (owner: 10Arturo Borrero Gonzalez) [12:38:05] (03CR) 10Majavah: [C: 03+1] cloudlb: haproxy: specify acme-chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/868656 (owner: 10Arturo Borrero Gonzalez) [12:38:07] (03CR) 10Cathal Mooney: "Thanks Arzhel for the feedback, some comments inline. Mostly that the examples here are just to illustrate how the definitions could be u" [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [12:39:51] (03CR) 10Majavah: [C: 03+1] cloudlb: haproxy: service: fix typo in srange specification [puppet] - 10https://gerrit.wikimedia.org/r/868651 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:40:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: service: fix typo in srange specification [puppet] - 10https://gerrit.wikimedia.org/r/868651 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:40:26] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudlb: haproxy: specify acme-chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/868656 (owner: 10Arturo Borrero Gonzalez) [12:47:23] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: allow cloudgw2001-dev to use openstack-codfw1dev certificate [puppet] - 10https://gerrit.wikimedia.org/r/868661 (https://phabricator.wikimedia.org/T324992) [12:48:57] (03CR) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [12:49:32] PROBLEM - Check systemd state on install1003 is CRITICAL: CRITICAL - degraded: The following units failed: squid-logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:03] (03CR) 10Majavah: [C: 03+1] acme_chief: allow cloudgw2001-dev to use openstack-codfw1dev certificate [puppet] - 10https://gerrit.wikimedia.org/r/868661 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:52:13] (03PS1) 10Jbond: installserver::proxy: add ipv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/868662 [12:52:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] installserver::proxy: add ipv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/868662 (owner: 10Jbond) [12:53:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: allow cloudgw2001-dev to use openstack-codfw1dev certificate [puppet] - 10https://gerrit.wikimedia.org/r/868661 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:54:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867568 (owner: 10Slyngshede) [12:59:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:01:04] (03PS1) 10Muehlenhoff: vrts: Enable vrts profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868663 (https://phabricator.wikimedia.org/T135991) [13:03:01] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:04:42] (03PS1) 10Alexandros Kosiaris: admin: Create hxi-ctr account [puppet] - 10https://gerrit.wikimedia.org/r/868664 (https://phabricator.wikimedia.org/T325004) [13:05:19] (03CR) 10CI reject: [V: 04-1] admin: Create hxi-ctr account [puppet] - 10https://gerrit.wikimedia.org/r/868664 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [13:10:20] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:01] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:19:36] PROBLEM - haproxy alive on cloudgw2001-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [13:19:36] PROBLEM - haproxy process on cloudgw2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:19:56] PROBLEM - Check systemd state on cloudgw2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:03] (03PS1) 10Majavah: cr-labs: allow acme-chief requests [homer/public] - 10https://gerrit.wikimedia.org/r/868665 (https://phabricator.wikimedia.org/T324992) [13:25:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: allow acme-chief requests [homer/public] - 10https://gerrit.wikimedia.org/r/868665 (https://phabricator.wikimedia.org/T324992) (owner: 10Majavah) [13:25:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868655 (https://phabricator.wikimedia.org/T323483) (owner: 10Volans) [13:28:38] (03PS2) 10Alexandros Kosiaris: admin: Create hxi-ctr account [puppet] - 10https://gerrit.wikimedia.org/r/868664 (https://phabricator.wikimedia.org/T325004) [13:29:15] (03CR) 10CI reject: [V: 04-1] admin: Create hxi-ctr account [puppet] - 10https://gerrit.wikimedia.org/r/868664 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [13:30:55] (03PS3) 10Alexandros Kosiaris: admin: Create hxi-ctr account [puppet] - 10https://gerrit.wikimedia.org/r/868664 (https://phabricator.wikimedia.org/T325004) [13:31:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/868665 (https://phabricator.wikimedia.org/T324992) (owner: 10Majavah) [13:31:52] (03CR) 10Cathal Mooney: [C: 03+2] cr-labs: allow acme-chief requests [homer/public] - 10https://gerrit.wikimedia.org/r/868665 (https://phabricator.wikimedia.org/T324992) (owner: 10Majavah) [13:32:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Create hxi-ctr account [puppet] - 10https://gerrit.wikimedia.org/r/868664 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [13:32:38] (03Merged) 10jenkins-bot: cr-labs: allow acme-chief requests [homer/public] - 10https://gerrit.wikimedia.org/r/868665 (https://phabricator.wikimedia.org/T324992) (owner: 10Majavah) [13:34:51] (03PS2) 10Alexandros Kosiaris: admin: Add mnz to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/868098 (https://phabricator.wikimedia.org/T325072) [13:35:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) 05Open→03Resolved a:03akosiaris Hi @HXi-WMF, your account has been created and access to the relevant groups... [13:36:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add mnz to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/868098 (https://phabricator.wikimedia.org/T325072) (owner: 10Alexandros Kosiaris) [13:36:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10akosiaris) 05Open→03Resolved a:03akosiaris Hi @MunizaA, access to the analytics-admins group has been granted. Please wait 30m for the access to propagate across... [13:41:53] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/868667 [13:42:06] RECOVERY - Check systemd state on cloudgw2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:07] (03CR) 10David Caro: [C: 03+2] "No problem :)" [puppet] - 10https://gerrit.wikimedia.org/r/868619 (owner: 10David Caro) [13:43:22] RECOVERY - haproxy process on cloudgw2001-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:45:54] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/868667 (owner: 10Muehlenhoff) [13:46:52] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868655 (https://phabricator.wikimedia.org/T323483) (owner: 10Volans) [13:47:29] (03CR) 10Volans: [C: 03+2] cumin::cloud_target: read also the cloud_cumin key [puppet] - 10https://gerrit.wikimedia.org/r/868655 (https://phabricator.wikimedia.org/T323483) (owner: 10Volans) [13:47:54] moritzm: can I merge your patch too? [13:48:14] volans: please do [13:48:18] RECOVERY - haproxy alive on cloudgw2001-dev is OK: OK check_alive uptime 375s https://wikitech.wikimedia.org/wiki/HAProxy [13:48:43] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2001-dev.codfw.wmnet with OS bullseye [13:48:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:09] (03PS1) 10Muehlenhoff: Add role_contacts for role::mariadb::misc::analytics::backup [puppet] - 10https://gerrit.wikimedia.org/r/868670 [13:51:28] (03CR) 10David Caro: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [13:52:34] (03PS1) 10Volans: cloudcumin: use the webproxy to connect to Cloud [puppet] - 10https://gerrit.wikimedia.org/r/868673 (https://phabricator.wikimedia.org/T319401) [13:54:57] (03PS2) 10Volans: cloudcumin: use the webproxy to connect to Cloud [puppet] - 10https://gerrit.wikimedia.org/r/868673 (https://phabricator.wikimedia.org/T319401) [13:55:05] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868673 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [13:56:28] (03PS17) 10Jbond: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [13:56:30] (03PS1) 10Jbond: ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 [13:58:14] (03CR) 10CI reject: [V: 04-1] ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 (owner: 10Jbond) [13:58:40] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [14:01:01] (03PS18) 10Jbond: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [14:01:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:29] (03PS2) 10Jbond: ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 [14:02:52] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [14:03:49] (03CR) 10CI reject: [V: 04-1] ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 (owner: 10Jbond) [14:10:05] (03PS12) 10Volans: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [14:10:17] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [14:15:48] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/868677 [14:19:47] (03CR) 10Volans: [C: 03+2] "Apparently this doesn't work as the key is not read..." [puppet] - 10https://gerrit.wikimedia.org/r/868655 (https://phabricator.wikimedia.org/T323483) (owner: 10Volans) [14:21:30] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/868677 (owner: 10Muehlenhoff) [14:23:00] (03PS1) 10JMeybohm: cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/868680 (https://phabricator.wikimedia.org/T325292) [14:24:35] (03CR) 10David Caro: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:26:41] (03CR) 10JMeybohm: "This is expected to fail CI because k8s 1.16 does not support seccompProfile to be set. The follow up CR removes the seccompProfile until " [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [14:27:02] (03PS2) 10JMeybohm: cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/868680 (https://phabricator.wikimedia.org/T325292) [14:28:45] (03PS1) 10David Caro: idp: add missing profile::idp::webauthn_relaying_party to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/868684 [14:29:51] (03CR) 10David Caro: "Missed this one too" [puppet] - 10https://gerrit.wikimedia.org/r/868684 (owner: 10David Caro) [14:29:59] (03CR) 10David Caro: [C: 03+2] idp: add missing profile::idp::webauthn_relaying_party to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/868684 (owner: 10David Caro) [14:32:33] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) [14:34:22] (03PS13) 10Volans: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [14:34:53] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [14:37:50] (03PS1) 10Jcrespo: mariadb: Reenable notifications on backup1 mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/868688 (https://phabricator.wikimedia.org/T313582) [14:38:29] (03CR) 10Jcrespo: [C: 04-1] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/868688 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [14:38:43] (03PS3) 10Jbond: ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 [14:40:49] (03CR) 10CI reject: [V: 04-1] ferm: example using new ferm::qos resource [puppet] - 10https://gerrit.wikimedia.org/r/868674 (owner: 10Jbond) [14:44:13] (03CR) 10AikoChou: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [14:52:49] (03CR) 10Elukey: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [14:54:05] (03PS1) 10Muehlenhoff: Etherpad: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868691 (https://phabricator.wikimedia.org/T135991) [14:59:21] (03PS1) 10Muehlenhoff: Remove misc-ops alias [puppet] - 10https://gerrit.wikimedia.org/r/868694 [15:01:07] (03PS1) 10Andrew Bogott: rabbitmq drain_queue.py: Don't error out non-oslo messages [puppet] - 10https://gerrit.wikimedia.org/r/868695 (https://phabricator.wikimedia.org/T325363) [15:03:04] (03CR) 10AikoChou: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [15:06:36] (03PS1) 10Muehlenhoff: Puppetboard: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868696 (https://phabricator.wikimedia.org/T135991) [15:09:11] (03CR) 10Reedy: Fix PHP string interpolation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [15:09:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868673 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:10:06] (03CR) 10David Caro: rabbitmq drain_queue.py: Don't error out non-oslo messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868695 (https://phabricator.wikimedia.org/T325363) (owner: 10Andrew Bogott) [15:12:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868694 (owner: 10Muehlenhoff) [15:13:47] (03CR) 10Ilias Sarantopoulos: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [15:14:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868696 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:15:20] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:22] (03PS1) 10AikoChou: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868702 (https://phabricator.wikimedia.org/T325199) [15:25:21] (03PS3) 10Reedy: Fix PHP string interpolation [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) [15:27:17] (03PS2) 10AikoChou: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868702 (https://phabricator.wikimedia.org/T325199) [15:27:33] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) @AnnWF glad you are in! [15:29:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove misc-ops alias [puppet] - 10https://gerrit.wikimedia.org/r/868694 (owner: 10Muehlenhoff) [15:29:35] (03PS1) 10Muehlenhoff: puppet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868703 (https://phabricator.wikimedia.org/T308013) [15:29:37] (03PS1) 10Muehlenhoff: analytics::cluster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868704 (https://phabricator.wikimedia.org/T308013) [15:29:39] (03PS1) 10Muehlenhoff: openstack::nova: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868705 (https://phabricator.wikimedia.org/T308013) [15:29:41] (03PS1) 10Muehlenhoff: redis: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868706 (https://phabricator.wikimedia.org/T308013) [15:29:43] (03PS1) 10Muehlenhoff: lists: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) [15:29:45] (03PS1) 10Muehlenhoff: orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868708 (https://phabricator.wikimedia.org/T308013) [15:29:47] (03PS1) 10Muehlenhoff: cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) [15:29:49] (03PS1) 10Muehlenhoff: acmechief/ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868710 (https://phabricator.wikimedia.org/T308013) [15:29:51] (03PS1) 10Muehlenhoff: vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013) [15:30:18] (03CR) 10Muehlenhoff: [C: 03+2] Puppetboard: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868696 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:33:25] (03PS4) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) [15:34:40] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@44d4e81]: Fix subtle bug on image_suggestions when resolving varprop. [15:34:49] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@44d4e81]: Fix subtle bug on image_suggestions when resolving varprop. (duration: 00m 09s) [15:36:23] (03CR) 10Krinkle: Fix PHP string interpolation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [15:37:01] (03CR) 10AikoChou: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [15:37:48] (03CR) 10Volans: [C: 03+2] cloudcumin: use the webproxy to connect to Cloud [puppet] - 10https://gerrit.wikimedia.org/r/868673 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:38:04] (03PS2) 10Muehlenhoff: cache: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868709 (https://phabricator.wikimedia.org/T308013) [15:38:40] (03CR) 10Elukey: ml-services: update revertrisk docker images (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [15:38:56] (03Abandoned) 10Volans: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:46:12] (03CR) 10Marostegui: [C: 03+1] orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868708 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:47:12] (03CR) 10Elukey: "Left some irrelevant comments but it looks good to me. There is 0 chance that I can spot anomalies (unless they are really big and promine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [15:50:44] (03CR) 10AikoChou: ml-services: update revertrisk docker images (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [15:51:12] (03CR) 10JMeybohm: cert-manager: Update to 1.10.1 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [15:51:44] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [15:51:46] 10SRE, 10SRE-swift-storage: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10MatthewVernon) [15:51:51] 10SRE, 10observability, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348 (10MatthewVernon) [15:51:53] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [15:51:55] 10SRE, 10SRE-swift-storage: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10MatthewVernon) [15:52:34] (03PS5) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) [15:52:45] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [15:52:49] 10SRE, 10SRE-swift-storage: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10MatthewVernon) [15:52:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10mpopov) Thank you @akosiaris! [15:54:20] (03CR) 10Ahmon Dancy: "A corresponding change was made yesterday to train-dev: https://gitlab.wikimedia.org/repos/releng/train-dev/-/commit/c0572d96faad51dc5cce2" [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [15:54:22] (03PS1) 10Jbond: base::cloud::production: allow cloud prod to override ssh [puppet] - 10https://gerrit.wikimedia.org/r/868716 [15:55:11] (03CR) 10Elukey: [C: 03+2] ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou) [15:55:16] (03PS3) 10Elukey: ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou) [15:55:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38847/console" [puppet] - 10https://gerrit.wikimedia.org/r/868716 (owner: 10Jbond) [15:57:28] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:58:20] (03PS2) 10Jbond: base::cloud::production: allow cloud prod to override ssh [puppet] - 10https://gerrit.wikimedia.org/r/868716 [15:58:49] (03PS5) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [15:59:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38848/console" [puppet] - 10https://gerrit.wikimedia.org/r/868716 (owner: 10Jbond) [16:00:15] (03PS3) 10AikoChou: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868702 (https://phabricator.wikimedia.org/T325199) [16:01:21] (03CR) 10AikoChou: ml-services: update revertrisk docker images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [16:01:56] (03CR) 10Krinkle: [C: 03+1] Fix PHP string interpolation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [16:03:59] (03PS3) 10Jbond: base::cloud::production: allow cloud prod to override ssh [puppet] - 10https://gerrit.wikimedia.org/r/868716 [16:09:07] (03PS1) 10MVernon: hiera: move swift accounts_keys into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) [16:09:36] (03CR) 10AOkoth: vrts: add vrts2001 values and add database port in config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:10:05] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:09] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:11:32] (03PS2) 10MVernon: hiera: move swift accounts_keys into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) [16:12:22] (03CR) 10Elukey: [C: 03+2] ml-services: update revertrisk docker images (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) (owner: 10AikoChou) [16:17:13] (03PS1) 10Sbailey: enable Linter extension maintTagTemplate.php in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868719 (https://phabricator.wikimedia.org/T175177) [16:18:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:23] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4007 [16:19:38] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4007 [16:20:15] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:20:59] (03PS1) 10Volans: cloudcumin: improve ssh config [puppet] - 10https://gerrit.wikimedia.org/r/868720 (https://phabricator.wikimedia.org/T319401) [16:21:18] (03PS1) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [16:22:57] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: gaenti4007 - robh@cumin2002" [16:23:59] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: gaenti4007 - robh@cumin2002" [16:23:59] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:24:48] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4007.mgmt.ulsfo.wmnet with reboot policy FORCED [16:25:30] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10MatthewVernon) I've put out two CRs; an equivalent change will also need doing to private-puppet. They'll all need co-ordinating. [16:28:45] (03CR) 10Dzahn: "looks good to me! can you compile it, please and link the result?" [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:34:59] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868720 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:35:05] (03CR) 10Jbond: [C: 03+1] cloudcumin: improve ssh config [puppet] - 10https://gerrit.wikimedia.org/r/868720 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:36:24] (03CR) 10Volans: [C: 03+2] cloudcumin: improve ssh config [puppet] - 10https://gerrit.wikimedia.org/r/868720 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:36:43] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Refactor swift credentials to be global rather than per-site - https://phabricator.wikimedia.org/T162123 (10MatthewVernon) [also review by a puppet expert :) ] [16:39:53] (03CR) 10Vgutierrez: [C: 03+1] acmechief/ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868710 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:45:49] (03PS2) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 [16:45:51] (03PS2) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 [16:46:54] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4007.mgmt.ulsfo.wmnet with reboot policy FORCED [16:49:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:26] (03PS1) 10Jbond: rake - spdx: also check hiera files [puppet] - 10https://gerrit.wikimedia.org/r/868723 [17:01:49] (03CR) 10David Caro: rabbitmq drain_queue.py: Don't error out non-oslo messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868695 (https://phabricator.wikimedia.org/T325363) (owner: 10Andrew Bogott) [17:03:40] (03CR) 10Jbond: [C: 03+2] rake - spdx: also check hiera files [puppet] - 10https://gerrit.wikimedia.org/r/868723 (owner: 10Jbond) [17:11:15] (03PS2) 10Andrew Bogott: rabbitmq drain_queue.py: Don't error out non-oslo messages [puppet] - 10https://gerrit.wikimedia.org/r/868695 (https://phabricator.wikimedia.org/T325363) [17:12:17] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq drain_queue.py: Don't error out non-oslo messages [puppet] - 10https://gerrit.wikimedia.org/r/868695 (https://phabricator.wikimedia.org/T325363) (owner: 10Andrew Bogott) [17:14:48] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) [17:15:08] (03PS2) 10David Caro: metricsinfra: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/868631 [17:15:10] (03PS1) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [17:15:39] RECOVERY - Check systemd state on install1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:06] (03PS2) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [17:16:45] (03PS3) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [17:18:37] (03CR) 10CI reject: [V: 04-1] metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [17:22:14] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) [17:25:49] (03CR) 10Dzahn: [C: 03+2] Etherpad: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868691 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:28:04] !log etherpad1003 - testing new: sudo systemctl start wmf_auto_restart_envoyproxy [17:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:38] (03CR) 10Dzahn: [C: 03+2] "manually started service once. no issues" [puppet] - 10https://gerrit.wikimedia.org/r/868691 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:30:27] (03PS1) 10Volans: cloud bastions: allow SSH from cloudcumin* hosts [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) [17:30:57] (03CR) 10Dzahn: [C: 03+2] vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [17:31:22] (03CR) 10Dzahn: [C: 03+1] vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [17:32:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [17:38:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38851/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:39:34] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [17:50:10] (03PS4) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [17:51:05] (03PS5) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [17:52:48] (03CR) 10CI reject: [V: 04-1] metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [17:52:56] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4007'] [17:53:02] (03CR) 10Brion VIBBER: [C: 03+1] "Heh I don't have +2 either :D but it looks right!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik) [17:53:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:53:10] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4007'] [17:56:12] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4007'] [17:59:15] (03CR) 10David Caro: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [18:08:37] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4007'] [18:09:09] (03PS2) 10Volans: cloud bastions: allow SSH from cloudcumin* hosts [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) [18:09:18] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [18:10:05] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) [18:10:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [18:11:09] (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [18:11:22] (03CR) 10Volans: [C: 03+2] cloud bastions: allow SSH from cloudcumin* hosts [puppet] - 10https://gerrit.wikimedia.org/r/868729 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [18:11:32] (03CR) 10Dzahn: "and if you could add a sentence or 2 to the commit message why we are doing this. that we need the new parameter because we use the slave," [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [18:13:11] (03PS6) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [18:14:13] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449 (owner: 10Muehlenhoff) [18:14:21] (03PS3) 10Andrew Bogott: labs_bootstrapvz: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839449 (owner: 10Muehlenhoff) [18:14:34] (03PS1) 10FNegri: Use a single file for public key [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [18:14:55] (03CR) 10CI reject: [V: 04-1] Use a single file for public key [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [18:16:52] (03PS6) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [18:17:44] (03PS1) 10RobH: ganeti4007 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/868733 (https://phabricator.wikimedia.org/T317247) [18:18:02] (03CR) 10RobH: [C: 03+2] ganeti4007 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/868733 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [18:18:45] (03CR) 10CI reject: [V: 04-1] vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [18:19:40] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4007.ulsfo.wmnet with OS bullseye [18:19:53] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bullseye [18:22:52] (03PS1) 10Andrew Bogott: OpenStack: remove Wallaby config [puppet] - 10https://gerrit.wikimedia.org/r/868734 [18:27:47] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [18:27:51] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/868734/38853/" [puppet] - 10https://gerrit.wikimedia.org/r/868734 (owner: 10Andrew Bogott) [18:28:17] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [18:30:14] (03PS2) 10Andrew Bogott: openstack::nova: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868705 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:32:15] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868705 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:32:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:01] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:40:42] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Due for decom - T318659 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:41:23] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage [18:44:26] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage [18:44:49] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:07] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:59:52] (03PS14) 10Volans: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:00:01] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin2002" [19:01:44] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:01:46] (03CR) 10CI reject: [V: 04-1] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:04:20] (03PS15) 10Volans: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:06:11] (03CR) 10CI reject: [V: 04-1] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:12:34] (03PS1) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) [19:14:55] (03CR) 10Dzahn: "Tyler, just asking about which admin group makes sense for jenkins deploy, not the rest of the keyholder stuff" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:15:33] (03CR) 10Dzahn: [V: 04-1] "some name mismatch. secret(): invalid secret keyholder/deploy_jenkins_ci" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:20:29] (03CR) 10Dzahn: [V: 04-1] "ah, Jaime, so do we need 1 identity or actually 2 identities? there is jenkins-ci and jenkins-releases but should we use the same or diffe" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:22:19] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) So.. question here.. We have 2 different "jenkins", jenkins on contint* and jenkins on releases*.... [19:22:40] (03PS16) 10Volans: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:24:47] (03CR) 10CI reject: [V: 04-1] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:26:11] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:26:33] (03PS2) 10Volans: Use a single file for public key [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [19:27:33] (03PS2) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) [19:29:00] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [19:31:57] (03PS17) 10Volans: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:39:09] (03CR) 10CI reject: [V: 04-1] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:42:53] (03CR) 10Volans: "Ready for review, I guess that PCC is not able to help us much here because file_line is executed by the puppet client at runtime." [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [19:44:07] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:49:13] (03PS1) 10Ahmon Dancy: profile::gitlab::runner::allowed_services: Add kubestagemaster [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) [19:50:48] (03PS1) 10Dzahn: keyholder: add fake deployment keys for jenkins deploy [labs/private] - 10https://gerrit.wikimedia.org/r/868738 (https://phabricator.wikimedia.org/T324014) [19:51:28] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:52:06] (03PS18) 10Jbond: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [19:53:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38856/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [19:55:19] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin2002" [19:55:20] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4007.ulsfo.wmnet with OS bullseye [19:55:26] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bullseye completed: - ganeti4007 (**PASS**) - Rem... [19:57:54] (03CR) 10Dzahn: [V: 03+2 C: 03+2] keyholder: add fake deployment keys for jenkins deploy [labs/private] - 10https://gerrit.wikimedia.org/r/868738 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:58:39] (03CR) 10Dzahn: "we needed https://gerrit.wikimedia.org/r/c/labs/private/+/868738 or compiling won't work" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:59:45] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 124 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:01:19] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:04:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) 05Open→03Resolved a:03RobH @MoritzMuehlenhoff, ganeti4007 is all yours and this resolves all pending misc ulsfo installs =] [20:06:42] (03PS1) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 [20:09:39] (03CR) 10Jbond: "@otto, luca, would be great to get you input in this, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [20:11:46] (03PS19) 10Jbond: spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [20:12:23] (03PS2) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 [20:17:08] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/868736/38857/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [20:32:55] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) If you are ok with "deployment-ci-admins" to be used as the admin group that can deploy jenkins..... [20:35:35] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) @thcipriani This is a bit like an access request, what do you think about the above? [20:38:39] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:01] (03CR) 10Subramanya Sastry: [C: 03+2] enable Linter extension maintTagTemplate.php in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868719 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [20:39:51] (03Merged) 10jenkins-bot: enable Linter extension maintTagTemplate.php in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868719 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [20:41:27] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:44:47] 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394 (10jhathaway) [20:45:41] 10SRE, 10Infrastructure-Foundations, 10Mail: Puppetry - https://phabricator.wikimedia.org/T325395 (10jhathaway) [20:46:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:46:42] 10SRE, 10Infrastructure-Foundations, 10Mail: Postfix Module - https://phabricator.wikimedia.org/T325396 (10jhathaway) [20:47:45] 10SRE, 10Infrastructure-Foundations, 10Mail: Rspamd module - https://phabricator.wikimedia.org/T325397 (10jhathaway) [20:48:28] 10SRE, 10Infrastructure-Foundations, 10Mail: Postfix MTA Profile - https://phabricator.wikimedia.org/T325398 (10jhathaway) [20:49:41] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:38] (03PS7) 10Dzahn: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:59:07] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/output/868467/38859/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [21:00:42] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "careful when deploying, as we spoke on IRC, but I think it's fine and an easy revert too. don't do after code freeze though." [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [21:03:08] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10BCornwall) [21:05:34] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-infra - https://phabricator.wikimedia.org/T325401 (10jhathaway) [21:07:55] 10SRE, 10Infrastructure-Foundations, 10Mail: mta-outbound-infra - https://phabricator.wikimedia.org/T325402 (10jhathaway) [21:08:15] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-infra - https://phabricator.wikimedia.org/T325402 (10jhathaway) [21:09:06] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10BCornwall) I've updated the list based on a recent week-long tcpdump on both DNS servers. @Papaul, could you update scs-a1-codfw.mgmt.codfw.wmnet to point to the newer `10.3.0.1` D... [21:10:51] 10SRE, 10Infrastructure-Foundations, 10Mail: MTA Provisioning - https://phabricator.wikimedia.org/T325403 (10jhathaway) [21:11:29] 10SRE, 10Infrastructure-Foundations, 10Mail: MTA Provisioning - https://phabricator.wikimedia.org/T325403 (10jhathaway) [21:11:31] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-infra - https://phabricator.wikimedia.org/T325401 (10jhathaway) [21:11:35] 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394 (10jhathaway) [21:11:41] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-infra - https://phabricator.wikimedia.org/T325402 (10jhathaway) [21:11:47] 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394 (10jhathaway) [21:11:53] 10SRE, 10Infrastructure-Foundations, 10Mail: MTA Provisioning - https://phabricator.wikimedia.org/T325403 (10jhathaway) [21:13:24] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-lists - https://phabricator.wikimedia.org/T325404 (10jhathaway) [21:14:15] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-lists - https://phabricator.wikimedia.org/T325405 (10jhathaway) [21:15:02] 10SRE, 10Infrastructure-Foundations, 10Mail: mta-inbound-wiki - https://phabricator.wikimedia.org/T325406 (10jhathaway) [21:15:37] 10SRE, 10Infrastructure-Foundations, 10Mail: mta-outbound-wiki - https://phabricator.wikimedia.org/T325407 (10jhathaway) [21:16:06] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-wiki - https://phabricator.wikimedia.org/T325407 (10jhathaway) [21:16:13] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-wiki - https://phabricator.wikimedia.org/T325406 (10jhathaway) [21:16:40] 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Null client configs - https://phabricator.wikimedia.org/T325408 (10jhathaway) [21:17:13] 10SRE, 10Infrastructure-Foundations, 10Mail: Remove Exim based MTAs - https://phabricator.wikimedia.org/T325409 (10jhathaway) [21:27:53] (03PS1) 10JHathaway: Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) [21:28:51] (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [21:30:28] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Postfix Module - https://phabricator.wikimedia.org/T325396 (10jhathaway) @jbond, happy to discuss any concerns you have with vendoring this module, either on this ticket or on gerrit. [21:40:17] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:49] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:51] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10thcipriani) >>! In T324014#8474913, @Dzahn wrote: > @thcipriani This is a bit like an access request, wh... [22:03:47] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:04:47] (03PS8) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [22:06:47] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:07:54] (03CR) 10AOkoth: [C: 03+2] vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [22:10:23] (03PS1) 10AOkoth: Revert "vrts: add vrts2001 values and add database port in config" [puppet] - 10https://gerrit.wikimedia.org/r/868534 [22:13:11] (03CR) 10AOkoth: [C: 03+2] Revert "vrts: add vrts2001 values and add database port in config" [puppet] - 10https://gerrit.wikimedia.org/r/868534 (owner: 10AOkoth) [22:23:38] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [22:24:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 (owner: 10Jbond) [22:27:23] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:12] (03CR) 10Volans: "LGTM for start testing it! Just one possible typo inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [22:40:05] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:40] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) [23:20:25] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [23:23:37] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [23:58:38] jouncebot: now [23:58:38] For the next 8 hour(s) and 1 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221216T0800)