[00:02:40] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:23:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091475 [00:38:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091475 (owner: 10TrainBranchBot) [00:40:25] (03CR) 10Scardenasmolinar: [C:03+1] Enable AutoModerator on afwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090968 (https://phabricator.wikimedia.org/T376597) (owner: 10Kgraessle) [00:48:54] 06SRE-OnFire, 10Incident Tooling: Corto: Scrutinize/finalize template text - https://phabricator.wikimedia.org/T376941#10324995 (10Eevans) a:03jhathaway [01:08:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091476 [01:08:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091476 (owner: 10TrainBranchBot) [01:21:07] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091475 (owner: 10TrainBranchBot) [01:27:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091476 (owner: 10TrainBranchBot) [02:01:48] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [02:02:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:02:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:32:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:37:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:11] PROBLEM - Host db1246 #page is DOWN: PING CRITICAL - Packet loss = 100% [04:42:18] here [04:44:04] RECOVERY - Host db1246 #page is UP: PING WARNING - Packet loss = 50%, RTA = 100.00 ms [04:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:45:19] PROBLEM - mysqld processes on db1246 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:45:19] PROBLEM - MariaDB Replica IO: s2 on db1246 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:46] rzl: here as well if you need more eyes / hands [04:46:01] swfrench-wmf: I'm just reading back through https://phabricator.wikimedia.org/T374215 for context [04:46:04] PROBLEM - MariaDB Replica SQL: s2 on db1246 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:46:05] PROBLEM - MariaDB read only s2 on db1246 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:46:29] I was about to depool when it came back, it just finished rebooting and I don't know what kind of state it's in so I'm going to complete the depool [04:46:55] * swfrench-wmf now realizes why he recognizes this hostname [04:47:04] SGTM [04:47:06] !log rzl@cumin2002 dbctl commit (dc=all): 'db1246 depooled', diff saved to https://phabricator.wikimedia.org/P71052 and previous config saved to /var/cache/conftool/dbconfig/20241115-044705-rzl.json [04:48:18] swfrench-wmf: I'll update the task but I think there's nothing else to do, have a good rest of your evening [04:49:40] rzl: sounds good - thank you for doing so, and likewise! [04:49:56] (I'm downtiming it for 36h, also) [04:50:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:50:48] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 12:00:00 on db1246.eqiad.wmnet with reason: depooled [04:51:04] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on db1246.eqiad.wmnet with reason: depooled [04:51:10] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325350 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ea7f6a3a-f936-4754-9253-b01ea562f9bc) set by rzl@cumin2002 for 1 day, 12:00:00 on 1... [04:51:38] wait, today's Thursday 🤦 [04:54:10] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 12:00:00 on db1246.eqiad.wmnet with reason: depooled [04:54:15] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 12:00:00 on db1246.eqiad.wmnet with reason: depooled [04:54:24] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325351 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f799d7b-940e-4ec5-afea-cee60176459f) set by rzl@cumin2002 for 3 days, 12:00:00 on 1... [04:54:26] there we go, that'll be until Monday morning [05:00:30] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325353 (10RLazarus) This just happened again: we got paged when it stopped answering ping. It rebooted on its own, but I left it depooled (and downtimed until... [05:32:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:05:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:17] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241115T0700) [07:12:11] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325441 (10ABran-WMF) thanks @RLazarus, will take care of it! @Jclark-ctr no problem, will provide the reports [07:22:24] (03PS1) 10Brouberol: airflow-analytics: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091527 (https://phabricator.wikimedia.org/T378439) [07:22:25] (03PS1) 10Brouberol: airflow-search: set kubeconfig owner group to analytics-delployers [puppet] - 10https://gerrit.wikimedia.org/r/1091523 (https://phabricator.wikimedia.org/T378441) [07:22:26] (03PS1) 10Brouberol: airflow-analytics: add namespace to tenant list of ceph csi and cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091528 (https://phabricator.wikimedia.org/T378439) [07:22:26] (03PS1) 10Brouberol: airflow-analytics: define user kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1091524 (https://phabricator.wikimedia.org/T378439) [07:22:27] (03PS1) 10Brouberol: airflow-analytics: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091529 (https://phabricator.wikimedia.org/T378439) [07:22:27] (03PS1) 10Brouberol: airflow-analytics: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) [07:22:31] (03PS1) 10Brouberol: airflow-analytics: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091526 (https://phabricator.wikimedia.org/T378439) [07:27:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:09] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [07:36:01] (03CR) 10Slyngshede: [C:03+1] "LGTM, but how many airflows do you need 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [07:36:15] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [07:36:47] ^ I deployed a new airflow config that restarted the scheduler [07:37:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:27] (03CR) 10Muehlenhoff: [C:03+2] Add ml-lab Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091259 (owner: 10Muehlenhoff) [07:42:25] (03CR) 10Brouberol: "We're currently migrating all 8 existing airflow [instances](https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Instances) " [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [07:50:59] (03CR) 10Muehlenhoff: airflow-analytics: define OIDC config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [07:55:14] (03PS2) 10Majavah: WMCS: Lookup IPv6 records more generally [puppet] - 10https://gerrit.wikimedia.org/r/1087951 [07:56:12] (03CR) 10Brouberol: airflow-analytics: define OIDC config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [07:57:57] (03CR) 10Majavah: [C:03+2] WMCS: Lookup IPv6 records more generally [puppet] - 10https://gerrit.wikimedia.org/r/1087951 (owner: 10Majavah) [07:58:41] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:07] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241115T0800) [08:00:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [08:04:06] (03CR) 10Majavah: [C:03+2] snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [08:06:10] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325508 (10ABran-WMF) >>! In T374215#10324063, @Jclark-ctr wrote: > @ABran-WMF Dell is requesting SOS report and TSR report from this server and another. can y... [08:08:24] (03PS1) 10Majavah: cirrus: Drop labtestwiki exclude [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091589 (https://phabricator.wikimedia.org/T378260) [08:09:06] (03PS1) 10Slyngshede: Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 [08:10:41] (03CR) 10CI reject: [V:04-1] Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 (owner: 10Slyngshede) [08:13:25] (03CR) 10Ayounsi: [C:03+1] "Purely aesthetic and not really needed here as there are not a lot of fields, but FYI you can add a `field="XXX"` in the fail()." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1090911 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [08:14:57] (03PS2) 10Slyngshede: Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 [08:15:01] (03PS2) 10Ayounsi: Enable validators for IKE/IPsec definitions [puppet] - 10https://gerrit.wikimedia.org/r/1090914 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [08:15:34] (03CR) 10Ayounsi: [C:03+1] Enable validators for IKE/IPsec definitions [puppet] - 10https://gerrit.wikimedia.org/r/1090914 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [08:16:31] (03CR) 10CI reject: [V:04-1] Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 (owner: 10Slyngshede) [08:16:34] (03PS1) 10Muehlenhoff: Remove script which has been obsolete for almost a decade [puppet] - 10https://gerrit.wikimedia.org/r/1091592 [08:16:38] (03CR) 10Ayounsi: [C:03+1] Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [08:17:06] (03PS3) 10Slyngshede: Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 [08:17:59] (03CR) 10Ayounsi: [C:03+1] Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [08:20:59] (03PS1) 10Muehlenhoff: bitu-ldap: Allow passing a description when creating a group [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 [08:22:35] (03CR) 10CI reject: [V:04-1] bitu-ldap: Allow passing a description when creating a group [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 (owner: 10Muehlenhoff) [08:30:34] (03CR) 10Vgutierrez: trafficserver: explicitly specify user/group for systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [08:31:57] (03PS1) 10Slyngshede: C:ldap::management require member(s) when add LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1091594 [08:35:11] (03CR) 10Brouberol: [C:03+1] wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [08:36:12] (03CR) 10Elukey: "Hey Keith! A couple of comments:" [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [08:38:04] (03Abandoned) 10Elukey: role::puppetserver: add admin groups config [puppet] - 10https://gerrit.wikimedia.org/r/1073733 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:39:54] (03CR) 10Elukey: Drop Python support for 3.7, 3.8, add 3.11 (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [08:40:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. We could also add a test where a memberless group is being created, maybe?" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 (owner: 10Slyngshede) [08:41:18] (03CR) 10Elukey: [C:03+1] Use importlib.metadata instead of pkg_resources [software/cumin] - 10https://gerrit.wikimedia.org/r/1029210 (owner: 10Volans) [08:41:36] (03CR) 10Elukey: [C:03+1] Add support for Python 3.12 [software/cumin] - 10https://gerrit.wikimedia.org/r/1090504 (owner: 10Volans) [08:42:01] (03CR) 10Muehlenhoff: [C:03+2] Remove script which has been obsolete for almost a decade [puppet] - 10https://gerrit.wikimedia.org/r/1091592 (owner: 10Muehlenhoff) [08:42:38] (03CR) 10Elukey: [C:03+1] "Didn't test it but I trust you :)" [software/cumin] - 10https://gerrit.wikimedia.org/r/1090505 (owner: 10Volans) [08:43:14] (03CR) 10Elukey: [C:03+1] Move Puppet CA monitoring out of the puppetmaster module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:43:56] (03CR) 10Elukey: [C:03+1] puppetserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1091245 (owner: 10Muehlenhoff) [08:45:17] (03Abandoned) 10Elukey: Revert "sre.hosts.reimage: improve UEFI for Supermicro" [cookbooks] - 10https://gerrit.wikimedia.org/r/1090907 (owner: 10JHathaway) [08:48:42] !log installing Linux 6.1.115 kernel updates from Bookworm point release [08:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:05] (03CR) 10Michael Große: [C:03+1] "Mh, what will be the effect of this change on beta?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [08:55:40] (03CR) 10Urbanecm: "Not much. Beta has the extension1 cluster declared as well, but it points to the same physical database. So, it would work, but we wouldn'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [08:59:20] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325616 (10MatthewVernon) @ABran-WMF is it a case of installing sosreport with apt and then running it? https://packages.debian.org/bookworm/sosreport [09:00:58] (03PS1) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) [09:01:23] (03PS2) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) [09:01:57] (03CR) 10Majavah: resolvconf: don't update resolv.conf with 0 nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [09:02:45] (03CR) 10Elukey: "Tested and it works nicely. The only doubt that I have is related to IPv6, namely if an internal client (like build2001) connects to the r" [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [09:02:54] (03PS1) 10Muehlenhoff: Deprecate system::role for remaining ServiceOps roles [puppet] - 10https://gerrit.wikimedia.org/r/1091599 [09:07:40] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325642 (10ABran-WMF) oh I was not aware of this package, will take care of it, thanks! [09:15:06] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Update [09:17:50] (03PS4) 10Slyngshede: Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 [09:19:52] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325722 (10revi) (Adding SRE and infra-foundations based on tasks at #mail.) For those without vrt-wiki access but have WMF-NDA, you have P710... [09:19:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10325741 (10MoritzMuehlenhoff) [09:20:53] (03CR) 10Elukey: [C:03+1] "LGTM! I didn't find any corner case, and the if/elseif block in envoy.pp is complicated but followable/manageable, so +1 to proceed :)" [puppet] - 10https://gerrit.wikimedia.org/r/1090798 (owner: 10Muehlenhoff) [09:21:55] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:22:22] (03PS5) 10Slyngshede: Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 [09:22:35] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:23:11] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:23:22] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:24:45] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325754 (10revi) > Additionally, it appears to be routed via Google., which perhaps has never been correct. If I recall correctly, `mx{1001|20... [09:26:50] (03PS1) 10Elukey: WIP: sre.hosts.provision: skip IPv6 autoconfig disable for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1091601 [09:27:54] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:28:08] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:28:33] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:28:43] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:29:43] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10325765 (10ABran-WMF) thanks @MatthewVernon here is the report @Jclark-ctr https://people.wikimedia.org/~arnaudb/sosreport-db1246-2024-11-15-izvnngv.tar.xz [09:29:47] (03PS2) 10Elukey: WIP: sre.hosts.provision: skip IPv6 autoconfig disable for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1091601 [09:31:46] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:32:33] RECOVERY - Ensure traffic_server is running for instance backend on cp4043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:33:25] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325767 (10taavi) for some reason the alias generator script thinks the alias is handled by google and does not route it to VRTS: ` Nov 15 09:0... [09:34:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [09:34:22] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [09:34:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [09:35:26] (03PS1) 10Slyngshede: Permissions: automatically attempt request validation on creation [software/bitu] - 10https://gerrit.wikimedia.org/r/1091602 [09:36:06] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325776 (10revi) I can definitely say it did not work that way yesterday: there was an incoming info-ko@wikimedia.org ticket at `2024-11-14T02:... [09:36:57] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:37:58] (03CR) 10CI reject: [V:04-1] Permissions: automatically attempt request validation on creation [software/bitu] - 10https://gerrit.wikimedia.org/r/1091602 (owner: 10Slyngshede) [09:39:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 (owner: 10Slyngshede) [09:40:32] (03CR) 10Muehlenhoff: "Thanks for the review, appreciated! I'll merge on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1090798 (owner: 10Muehlenhoff) [09:43:39] (03PS1) 10Samtar: Revert "Allow other input and changes to trigger searchsuggestions to update" [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091605 [09:43:53] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325786 (10revi) It seems like it's entire `@wikimedia.org` that is refusing to route to VRTS. My test email to `oversight-ko-wp@wikimedia.org`... [09:43:55] ^ Sounds like more urgent that I thought... :P [09:44:37] (03CR) 10Samtar: "(just in case we want to do a Friday backport)" [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091605 (owner: 10Samtar) [09:47:09] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325804 (10Krd) That actually make VRTS fubar. Unbreak now please. [09:51:19] (03PS1) 10Muehlenhoff: Add build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1091606 (https://phabricator.wikimedia.org/T379343) [09:58:28] (03CR) 10Muehlenhoff: [C:03+2] Add build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1091606 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [09:58:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova_fullstak: deploy from git with IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1091607 (https://phabricator.wikimedia.org/T379356) [09:58:54] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091607 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [09:59:17] (03CR) 10CI reject: [V:04-1] openstack: nova_fullstak: deploy from git with IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1091607 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [10:01:38] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova_fullstak: deploy from git with IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1091607 (https://phabricator.wikimedia.org/T379356) [10:01:46] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091607 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [10:03:06] (03PS1) 10Vgutierrez: debian: Add 0010-cap-setgid.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 [10:07:47] (03PS1) 10Muehlenhoff: Fix wdqs-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091609 [10:09:56] taavi: can we get that ticket fast-tracked? It seems like half of OTRS is broken because @wikimedia.org won't route [10:11:04] (Amir(one) replied @ -tech so nvm I guess?) [10:11:18] (03CR) 10Muehlenhoff: [C:03+2] Fix wdqs-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091609 (owner: 10Muehlenhoff) [10:16:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325873 (10Jelto) [10:18:10] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325882 (10revi) [10:19:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325876 (10Ladsgroup) p:05Triage→03Unbreak! Nothing in recently merged patches of puppet stands out neither anything i... [10:23:05] (03CR) 10Slyngshede: [C:03+2] Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 (owner: 10Slyngshede) [10:24:34] (03Merged) 10jenkins-bot: Require members to be provide on group create. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091590 (owner: 10Slyngshede) [10:28:15] (03PS2) 10Muehlenhoff: bitu-ldap: Allow passing a description when creating a group [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 [10:28:49] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova_fullstak: deploy from git with IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1091607 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [10:29:35] (03CR) 10CI reject: [V:04-1] bitu-ldap: Allow passing a description when creating a group [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 (owner: 10Muehlenhoff) [10:33:47] (03PS3) 10Muehlenhoff: bitu-ldap: Allow passing a description when creating a group [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 [10:34:38] (03CR) 10Fabfur: [C:03+1] debian: Add 0010-cap-setgid.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (owner: 10Vgutierrez) [10:35:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325930 (10Ladsgroup) Asking ITS if anything changed on their side recently. [10:36:41] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:36:51] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:36:58] (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 (owner: 10Muehlenhoff) [10:39:26] (03CR) 10Muehlenhoff: [C:03+2] bitu-ldap: Allow passing a description when creating a group [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1091593 (owner: 10Muehlenhoff) [10:44:08] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325964 (10Ladsgroup) As a fast fix, we can put vrt transport rule before gmail rule to make sure it gets checked first. [10:44:10] 07sre-alert-triage, 10observability: Alert in need of triage: JobUnavailable - https://phabricator.wikimedia.org/T380022 (10LSobanski) 03NEW [10:45:48] (03CR) 10Cathal Mooney: [C:03+1] Remove office policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/1090897 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi) [10:46:59] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10325989 (10Krd) Can't we just check what was changed yesterday, and undo that? [10:47:31] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024 (10LSobanski) 03NEW [10:48:01] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10326001 (10LSobanski) Same alert is firing for ml-serve-eqiad and ml-staging-codfw. [10:48:07] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10326002 (10LSobanski) Same alert is firing for ml-serve-eqiad and ml-staging-codfw. [10:48:28] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10326003 (10LSobanski) [10:48:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10326004 (10taavi) >>! In T380009#10325964, @Ladsgroup wrote: > As a fast fix, we can put vrt transport rule before gmail r... [10:50:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10326030 (10elukey) While provisioning I see the following error for the BMC NIC config: ` Error: {'error': {'code': 'Base.v1_10_3.GeneralError', 'Messag... [10:50:49] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova_fullstack: fix service parameters and enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091614 (https://phabricator.wikimedia.org/T379356) [10:51:00] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091614 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [10:54:59] (03Abandoned) 10Vgutierrez: debian: Add 0010-cap-setgid.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (owner: 10Vgutierrez) [10:56:52] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova_fullstack: fix service parameters and enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091614 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [10:59:28] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10326046 (10Ladsgroup) My postfix knowledge is not really good but what I mean is this order: ` transport_maps = regexp:/et... [11:00:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10326051 (10Ladsgroup) >>! In T380009#10325989, @Krd wrote: > Can't we just check what was changed yesterday, and undo that... [11:05:45] !log homer 'cr*eqiad*' commit 'T377022' [11:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:50] T377022: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022 [11:06:28] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:06:37] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:08:48] (03Restored) 10Vgutierrez: debian: Add 0010-cap-setgid.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (owner: 10Vgutierrez) [11:09:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10326076 (10Clement_Goubert) Thanks a bunch! [11:10:11] (03CR) 10Fabfur: haproxy: add ring support to haproxy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:11:34] (03PS3) 10Elukey: WIP: sre.hosts.provision: skip IPv6 autoconfig disable for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1091601 [11:12:08] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10326074 (10eoghan) So the issue is coming from the vrts_aliases.py cron job. Something has changed in how gmail is respond... [11:12:11] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:12:21] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:13:51] (03PS2) 10Vgutierrez: debian: Add 0010-cap-setgid.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 [11:13:52] (03PS13) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [11:14:34] (03PS4) 10Elukey: WIP: sre.hosts.provision: skip IPv6 autoconfig disable for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1091601 [11:14:56] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:15:38] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:19:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:19:06] !log homer 'lsw1-e5-eqiad*' commit 'T377022' [11:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:09] T377022: wikikube-worker13[05-12] implementation tracking - https://phabricator.wikimedia.org/T377022 [11:20:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10326100 (10elukey) @Papaul @Jhancock.wm we'd need to upgrade the firmware on this node, I think that we could use directly [[ https://www.supermicro.com/... [11:20:39] !log homer 'lsw1-e6-eqiad*' commit 'T377022' [11:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:48] !log homer 'lsw1-e7-eqiad*' commit 'T377022' [11:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:21:33] !log homer 'lsw1-f7-eqiad*' commit 'T377022' [11:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:41] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:22:02] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:22:16] !log homer 'lsw1-f6-eqiad*' commit 'T377022' [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:21] (03Abandoned) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [11:22:53] !log homer 'lsw1-f5-eqiad*' commit 'T377022' [11:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:07] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1305-1312].eqiad.wmnet [11:24:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1305-1312].eqiad.wmnet [11:27:12] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:36:12] (03PS1) 10EoghanGaffney: mx: Update vrts_aliases script not to check gmail for wm.o addresses [puppet] - 10https://gerrit.wikimedia.org/r/1091628 (https://phabricator.wikimedia.org/T380009) [11:37:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:34] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@2c533d6]: hotfix image suggestions weekly snapshots [11:38:04] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@2c533d6]: hotfix image suggestions weekly snapshots (duration: 00m 57s) [11:38:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10326161 (10elukey) @Jclark-ctr Hello! For this host, we have to follow a new workflow: 1) The provision cookbook needs to be run with `--uefi`, since ot... [11:39:23] (03CR) 10LSobanski: [C:03+1] mx: Update vrts_aliases script not to check gmail for wm.o addresses [puppet] - 10https://gerrit.wikimedia.org/r/1091628 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [11:40:26] (03CR) 10EoghanGaffney: [C:03+2] mx: Update vrts_aliases script not to check gmail for wm.o addresses [puppet] - 10https://gerrit.wikimedia.org/r/1091628 (https://phabricator.wikimedia.org/T380009) (owner: 10EoghanGaffney) [11:42:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:50] (03CR) 10Cathal Mooney: [C:03+2] Add validators for Netbox IPsec elements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1090911 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [11:53:57] (03PS1) 10Arturo Borrero Gonzalez: openstack: horizon: enable IPv6 on neutron panels [puppet] - 10https://gerrit.wikimedia.org/r/1091632 (https://phabricator.wikimedia.org/T377339) [11:55:36] (03PS2) 10Arturo Borrero Gonzalez: openstack: horizon: enable IPv6 on neutron panels [puppet] - 10https://gerrit.wikimedia.org/r/1091632 (https://phabricator.wikimedia.org/T377339) [11:56:15] (03Merged) 10jenkins-bot: Add validators for Netbox IPsec elements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1090911 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [11:56:39] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10326236 (10eoghan) p:05Unbreak!→03High We've made a change to the aliases routing script which we believe has fixed th... [11:58:42] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:59:02] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091632 (https://phabricator.wikimedia.org/T377339) (owner: 10Arturo Borrero Gonzalez) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241115T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241115T1200). [12:00:15] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:01:00] PROBLEM - Disk space on Hadoop worker on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:01:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [12:01:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [12:01:37] aokoth@cumin1002 aokoth: The backup on gitlab2002 is complete, ready to proceed with upgrade. [12:01:44] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:03:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host build2002.codfw.wmnet [12:03:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:04:16] (03CR) 10Ladsgroup: [C:03+1] cirrus: Drop labtestwiki exclude [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091589 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [12:06:30] (03CR) 10Btullis: [C:03+1] airflow-analytics: define user kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1091524 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:06:47] (03CR) 10Btullis: [C:03+1] airflow-analytics: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:07:00] (03CR) 10Btullis: [C:03+1] airflow-analytics: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091526 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:07:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:28] (03CR) 10Btullis: [C:03+1] airflow-search: set kubeconfig owner group to analytics-delployers [puppet] - 10https://gerrit.wikimedia.org/r/1091523 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [12:07:33] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2308 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:07:38] (03CR) 10Brouberol: airflow-analytics: define OIDC config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:07:46] (03CR) 10Brouberol: [C:03+2] airflow-search: set kubeconfig owner group to analytics-delployers [puppet] - 10https://gerrit.wikimedia.org/r/1091523 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [12:07:46] (03CR) 10Btullis: [C:03+1] airflow-analytics: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091527 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:07:48] (03CR) 10Brouberol: [C:03+2] airflow-analytics: define user kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1091524 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:08:00] (03CR) 10Btullis: [C:03+1] airflow-analytics: add namespace to tenant list of ceph csi and cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091528 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:08:16] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Update [12:08:35] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 107028 bytes in 1.260 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:08:39] (03CR) 10Btullis: [C:03+1] airflow-analytics: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091529 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:08:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:16] (03CR) 10Brouberol: [C:03+2] airflow-analytics: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091525 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:11:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM build2002.codfw.wmnet - jmm@cumin2002" [12:11:28] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox [12:12:14] (03CR) 10Brouberol: [C:03+1] Enable deletion of unused segments on the druid-analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1090842 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [12:12:25] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:09] (03CR) 10Brouberol: [C:03+2] airflow-analytics: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091527 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:13:12] (03CR) 10Brouberol: [C:03+2] airflow-analytics: add namespace to tenant list of ceph csi and cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091528 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:13:14] (03CR) 10Brouberol: [C:03+2] airflow-analytics: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091529 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:13:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM build2002.codfw.wmnet - jmm@cumin2002" [12:15:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:15:29] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache build2002.codfw.wmnet on all recursors [12:15:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) build2002.codfw.wmnet on all recursors [12:15:58] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM build2002.codfw.wmnet - jmm@cumin2002" [12:16:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM build2002.codfw.wmnet - jmm@cumin2002" [12:16:40] (03Merged) 10jenkins-bot: airflow-analytics: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091527 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:16:42] (03Merged) 10jenkins-bot: airflow-analytics: add namespace to tenant list of ceph csi and cloudnative-pg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091528 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:16:52] (03Merged) 10jenkins-bot: airflow-analytics: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091529 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:17:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:17:25] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host build2002.codfw.wmnet with OS bookworm [12:17:36] 06SRE, 06Infrastructure-Foundations: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10326333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm [12:18:14] (03CR) 10Cathal Mooney: [C:03+2] Enable validators for IKE/IPsec definitions [puppet] - 10https://gerrit.wikimedia.org/r/1090914 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [12:18:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:19:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [12:24:00] (03PS1) 10Brouberol: airflow-analytics: fix typo in db username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091642 (https://phabricator.wikimedia.org/T378439) [12:25:28] (03CR) 10Brouberol: [C:03+2] airflow-analytics: fix typo in db username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091642 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:26:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [12:27:07] (03CR) 10Brouberol: [C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1091526 (https://phabricator.wikimedia.org/T378439) (owner: 10Brouberol) [12:27:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [12:32:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on build2002.codfw.wmnet with reason: host reimage [12:36:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2002.codfw.wmnet with reason: host reimage [12:37:25] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:27] (03CR) 10Ayounsi: [C:03+2] Remove office policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/1090897 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi) [12:38:00] (03Merged) 10jenkins-bot: Remove office policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/1090897 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi) [12:40:44] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] openstack: horizon: enable IPv6 on neutron panels [puppet] - 10https://gerrit.wikimedia.org/r/1091632 (https://phabricator.wikimedia.org/T377339) (owner: 10Arturo Borrero Gonzalez) [12:41:47] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: horizon: enable IPv6 on neutron panels [puppet] - 10https://gerrit.wikimedia.org/r/1091632 (https://phabricator.wikimedia.org/T377339) (owner: 10Arturo Borrero Gonzalez) [12:46:09] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10326497 (10aborrero) [12:48:48] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10326506 (10aborrero) 05Stalled→03Resolved a:03aborrero this was done by means of {T378192} [12:51:53] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10326522 (10aborrero) 05In progress→03Resolved [12:52:19] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10326524 (10aborrero) [12:53:57] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10326527 (10aborrero) 05Open→03Resolved a:03aborrero I think we can consider IPv6 to be fully working on codfw1dev. [12:53:58] (03PS1) 10Brouberol: airflow: define the webserver.base_url configuration [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) [12:54:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host build2002.codfw.wmnet with OS bookworm [12:54:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host build2002.codfw.wmnet [12:54:16] (03CR) 10Alexandros Kosiaris: [C:04-1] "minor nitpick about the * not being needed and a question." [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [12:54:54] 06SRE, 06Infrastructure-Foundations: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10326537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm completed: - build2002 (**PASS**) - Removed from Pupp... [12:57:32] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4532/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) (owner: 10Brouberol) [13:00:12] (03CR) 10Cathal Mooney: [C:03+2] Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [13:01:22] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#10326578 (10ayounsi) 05Stalled→03Declined Going to close that task as we're not planning on using gNMI for automation any further, due to various shortcom... [13:01:33] (03PS2) 10Brouberol: airflow: define the webserver.base_url configuration [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) [13:01:39] !log imported 8u432-b06-2~deb12u1 to component/jdk8 for bookworm (forward port of the latest Java 8 security fixes for Bookworm) [13:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:26] 10SRE-tools, 06Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045#10326590 (10ayounsi) 05Open→03Declined Thanks for dictdiffer, because of a change in priorities and current limitations in pyGNMI, there is no more need to packag... [13:03:48] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638#10326595 (10ayounsi) 05Stalled→03Declined Because of the various limitations listed in {T340045} we're not going to proceed any further on Del... [13:04:52] 06SRE, 06Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028#10326604 (10ayounsi) 05Stalled→03Declined Because of the various limitations listed in {T342673} (plus the ones from pygnmi) we're not going to proceed any further on Dell... [13:04:56] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4533/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) (owner: 10Brouberol) [13:06:20] (03PS1) 10Muehlenhoff: java: Update image version tag to match the java bookworm update [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1091669 [13:17:08] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Update homer wmf-plugin to export Netbox ipsec data - cmooney@cumin1002 [13:19:02] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Update homer wmf-plugin to export Netbox ipsec data - cmooney@cumin1002 [13:21:46] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Update homer wmf-plugin to export Netbox ipsec data - cmooney@cumin1002 [13:22:39] 10ops-eqiad, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050 (10ayounsi) 03NEW [13:23:48] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts sretest1004.eqiad.wmnet [13:24:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Update homer wmf-plugin to export Netbox ipsec data - cmooney@cumin1002 [13:26:00] (03PS1) 10Brouberol: airflow-platform-eng: define kube namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091697 (https://phabricator.wikimedia.org/T378443) [13:26:00] (03PS1) 10Brouberol: airflow-platform-eng: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091694 (https://phabricator.wikimedia.org/T378443) [13:26:02] (03PS1) 10Brouberol: airflow-platform-eng: register kube namespace in ceph csi / cloudnative pg tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091698 (https://phabricator.wikimedia.org/T378443) [13:26:02] (03PS1) 10Brouberol: airflow-platform-eng: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091695 (https://phabricator.wikimedia.org/T378443) [13:26:04] (03PS1) 10Brouberol: airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) [13:26:04] (03PS1) 10Brouberol: airflow-platform-eng: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1091696 (https://phabricator.wikimedia.org/T378443) [13:27:26] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:28:40] (03CR) 10Cathal Mooney: [C:03+2] Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [13:29:13] (03Merged) 10jenkins-bot: Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [13:29:23] (03CR) 10Vgutierrez: docker_registry_ha: limit /v2/_catalog to internal IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [13:31:02] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:31:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:31:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:20] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts sretest1004.eqiad.wmnet [13:31:29] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10326695 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `sretest1004.eqiad.wmnet` - sretest1004.eqiad.wmnet (**FAIL**) - //Host not found on Icinga, u... [13:31:59] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10326691 (10ayounsi) [13:33:59] (03PS1) 10Ayounsi: Remove v6 include for e8/f8 uplinks [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) [13:35:14] (03CR) 10CI reject: [V:04-1] Remove v6 include for e8/f8 uplinks [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) (owner: 10Ayounsi) [13:35:15] (03CR) 10Brouberol: [C:03+1] java: Update image version tag to match the java bookworm update [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1091669 (owner: 10Muehlenhoff) [13:36:00] (03CR) 10Brouberol: "Almost done! Only one after this one" [puppet] - 10https://gerrit.wikimedia.org/r/1091695 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:37:08] (03PS1) 10Ayounsi: Disable SSH password auth on all devices [homer/public] - 10https://gerrit.wikimedia.org/r/1091725 (https://phabricator.wikimedia.org/T379464) [13:41:57] !log test no-passwords on mr1-eqsin - T379464 [13:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:20] (03PS1) 10Muehlenhoff: Remove obsolete package::builder role [puppet] - 10https://gerrit.wikimedia.org/r/1091729 [13:46:22] (03CR) 10Ayounsi: "CI needs to be re-ran once the IPs have been deleted from Netbox." [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) (owner: 10Ayounsi) [13:46:57] (03CR) 10DCausse: "lgtm, left a question but please feel to go ahead (might be me not fully understanding the context here)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:47:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091729 (owner: 10Muehlenhoff) [13:47:25] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] java: Update image version tag to match the java bookworm update [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1091669 (owner: 10Muehlenhoff) [13:47:53] (03PS2) 10Brouberol: airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) [13:49:26] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091694 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:49:44] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091695 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:50:07] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1091696 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:50:27] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: define kube namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091697 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:50:44] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: register kube namespace in ceph csi / cloudnative pg tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091698 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:52:06] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:53:27] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091694 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:53:41] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091695 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:54:00] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1091696 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:54:17] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: define kube namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091697 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:54:31] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: register kube namespace in ceph csi / cloudnative pg tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091698 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:54:47] (03CR) 10Stevemunene: airflow-platform-eng: define helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:55:00] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:55:03] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:55:19] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:55:42] (03CR) 10Btullis: airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:56:19] (03CR) 10Brouberol: airflow-platform-eng: define helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [13:59:13] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove e8 lo0 IP - ayounsi@cumin1002" [13:59:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove e8 lo0 IP - ayounsi@cumin1002" [13:59:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:45] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [14:02:56] (03PS3) 10Brouberol: airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) [14:04:54] 07Puppet: Keepalived Puppet module: Support IPv6 - https://phabricator.wikimedia.org/T380057 (10taavi) 03NEW [14:06:40] (03PS1) 10Jcrespo: backup: Move Dell bacula hosts to mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/1091731 (https://phabricator.wikimedia.org/T376892) [14:10:33] (03CR) 10Jcrespo: [C:03+2] backup: Move Dell bacula hosts to mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/1091731 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo) [14:11:31] (03CR) 10Stevemunene: [C:03+1] airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [14:14:15] (03PS3) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [14:14:15] (03PS2) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [14:14:15] (03PS1) 10Majavah: keepalived: Split failover config template to new class [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) [14:14:16] (03PS1) 10Majavah: keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T379175) [14:15:11] (03PS2) 10Majavah: keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) [14:15:14] (03PS4) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [14:15:14] (03PS3) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [14:15:21] (03CR) 10CI reject: [V:04-1] dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [14:16:54] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4534/console" [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [14:17:12] (03PS1) 10Muehlenhoff: Add two new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1091735 (https://phabricator.wikimedia.org/T375729) [14:20:20] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [14:22:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4538/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [14:25:27] (03PS3) 10Majavah: keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) [14:25:27] (03PS5) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [14:25:27] (03PS4) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [14:26:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4539/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) (owner: 10Majavah) [14:32:12] 07Puppet, 07IPv6, 13Patch-For-Review: Keepalived Puppet module: Support IPv6 - https://phabricator.wikimedia.org/T380057#10326995 (10taavi) [14:32:58] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10326996 (10MoritzMuehlenhoff) [14:34:17] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10326998 (10RobH) > "Created by: cbissell Cross connect has been disconnected, per our policy it will be removed after 48 hours" I'll... [14:35:15] (03PS3) 10Vgutierrez: debian: Add 0010-cap-setgid.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 [14:35:41] (03PS4) 10Vgutierrez: debian: Add 0010-initgroups.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 [14:39:38] (03CR) 10Muehlenhoff: [C:03+1] "Maybe add a reference to https://github.com/apache/trafficserver/commit/ae638096e259121d92d46a9f57026a5ff5bc328b ?" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (owner: 10Vgutierrez) [14:42:22] (03CR) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [14:47:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10327025 (10MoritzMuehlenhoff) [14:53:25] (03CR) 10Herron: [V:03+1] "Hey! Thanks for having a look, this is what I was thinking step-wise https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088610/comment" [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [14:54:15] (03CR) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [14:56:07] (03CR) 10Ssingh: [C:03+1] "Hi folks: can we merge this? We have some upcoming magru maintenance coming up so we want this to be in place. Much thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite) [15:03:06] (03PS2) 10Slyngshede: Permissions: automatically attempt request validation on creation [software/bitu] - 10https://gerrit.wikimedia.org/r/1091602 [15:07:12] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4043 is OK: HTTP OK: HTTP/1.0 200 OK - 36327 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:07:12] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is OK: HTTP OK: HTTP/1.1 200 OK - 48606 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:14:10] (03CR) 10Clément Goubert: [C:03+1] Deprecate system::role for remaining ServiceOps roles [puppet] - 10https://gerrit.wikimedia.org/r/1091599 (owner: 10Muehlenhoff) [15:15:59] 06SRE-OnFire, 10Incident Tooling: Corto: Scrutinize/finalize template text - https://phabricator.wikimedia.org/T376941#10327067 (10jhathaway) 05Open→03Resolved Updated template to more closely match the original, https://gitlab.wikimedia.org/repos/sre/corto/-/merge_requests/28 [15:16:58] (03PS5) 10Vgutierrez: debian: Add 0010-initgroups.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (https://phabricator.wikimedia.org/T379797) [15:18:50] (03CR) 10CDanis: [C:03+1] role::aux_k8s::worker: add role to 2 new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1088610 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:18:54] (03CR) 10CDanis: [C:03+1] aux_k8s: enable new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:19:09] (03CR) 10Vgutierrez: "I've referenced the PR and reported the issue on https://github.com/apache/trafficserver/issues/11869" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (https://phabricator.wikimedia.org/T379797) (owner: 10Vgutierrez) [15:19:12] (03CR) 10Ssingh: [C:03+1] "LGTM!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (https://phabricator.wikimedia.org/T379797) (owner: 10Vgutierrez) [15:20:49] (03CR) 10Vgutierrez: [C:03+2] debian: Add 0010-initgroups.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1091608 (https://phabricator.wikimedia.org/T379797) (owner: 10Vgutierrez) [15:21:26] (03CR) 10Herron: [V:03+1 C:03+2] role::aux_k8s::worker: add role to 2 new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1088610 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:21:46] (03CR) 10Herron: [V:03+1 C:03+2] aux_k8s: enable new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [15:25:49] (03CR) 10Brouberol: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1091735 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [15:26:41] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091694 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:26:51] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091695 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:29:53] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: define kube namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091697 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:29:57] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: register kube namespace in ceph csi / cloudnative pg tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091698 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:29:59] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:33:26] (03Merged) 10jenkins-bot: airflow-platform-eng: define kube namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091697 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:33:35] (03Merged) 10jenkins-bot: airflow-platform-eng: register kube namespace in ceph csi / cloudnative pg tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091698 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:33:36] (03Merged) 10jenkins-bot: airflow-platform-eng: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091699 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:34:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:35:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:37:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [15:37:35] (03PS1) 10Ssingh: trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 [15:38:00] 06SRE-OnFire, 10Incident Tooling: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#10327129 (10Eevans) [15:38:01] 06SRE-OnFire, 10Incident Tooling: Corto internal incident response workflow automation (MVP) - https://phabricator.wikimedia.org/T356790#10327130 (10Eevans) [15:38:02] 06SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467#10327131 (10Eevans) [15:38:10] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [15:38:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [15:39:02] (03CR) 10Brouberol: [C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1091696 (https://phabricator.wikimedia.org/T378443) (owner: 10Brouberol) [15:39:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [15:39:15] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4540/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [15:39:23] (03PS1) 10Bernard Wang: Reenable non-UI experiement quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 [15:39:47] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [15:40:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [15:41:52] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [15:42:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [15:43:24] (03CR) 10Brouberol: [C:03+1] "Sorry, my previous response was lacking a vote" [puppet] - 10https://gerrit.wikimedia.org/r/1091735 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [15:47:04] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10327189 (10cmooney) [15:47:05] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10327190 (10cmooney) [15:47:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2230.codfw.wmnet,db1125.eqiad.wmnet with reason: testing stuff on test-s4 [15:47:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2230.codfw.wmnet,db1125.eqiad.wmnet with reason: testing stuff on test-s4 [15:53:38] (03PS2) 10Aleksandar Mastilovic: Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 [15:54:43] (03CR) 10CI reject: [V:04-1] Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 (owner: 10Aleksandar Mastilovic) [15:57:50] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10327243 (10akosiaris) 05Open→03Resolved Resolving, feel free to reopen. [16:00:24] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.6-1wm2_amd64.changes: T379797 [16:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:41] T379797: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797 [16:02:18] (03PS1) 10FNegri: Allow pty allocation for cumin ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) [16:03:19] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4051*} and A:cp for 9.2.6-1wm2 [16:05:34] (03PS1) 10Bking: dse-k8s-service: fix name of net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1091756 (https://phabricator.wikimedia.org/T371994) [16:06:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4051*} and A:cp for 9.2.6-1wm2 [16:07:47] (03CR) 10Btullis: [C:03+1] airflow: define the webserver.base_url configuration [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) (owner: 10Brouberol) [16:08:52] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for cp4043.ulsfo.wmnet [16:08:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp4043.ulsfo.wmnet [16:09:20] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet [reason: ATS fixed] [16:09:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091756 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [16:12:38] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10327369 (10RobH) They confirmed receipt of the message/ticket but did not confirm the date/time specifically so I've sent a followup just now: > Suppo... [16:16:41] (03CR) 10Brouberol: [C:03+1] dse-k8s-service: fix name of net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1091756 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [16:22:28] !log herron@cumin2002 conftool action : set/weight=10; selector: name=aux-k8s-worker1004.eqiad.wmnet,cluster=aux-k8s,service=kubesvc [16:22:43] !log herron@cumin2002 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1004.eqiad.wmnet,cluster=aux-k8s,service=kubesvc [16:27:07] (03CR) 10Bking: [C:03+2] dse-k8s-service: fix name of net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1091756 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [16:27:26] !log herron@cumin2002 conftool action : set/weight=10; selector: name=aux-k8s-worker1005.eqiad.wmnet,cluster=aux-k8s,service=kubesvc [16:27:37] !log herron@cumin2002 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1005.eqiad.wmnet,cluster=aux-k8s,service=kubesvc [16:32:21] (03PS1) 10Majavah: P:toolforge: Drop Cloud VPS submission special case [puppet] - 10https://gerrit.wikimedia.org/r/1091762 [16:32:47] (03PS1) 10Andrew Bogott: Neutron: allow observer user to get IP availability [puppet] - 10https://gerrit.wikimedia.org/r/1091763 (https://phabricator.wikimedia.org/T380069) [16:32:49] (03PS1) 10Andrew Bogott: nova policy: allow public access to os-hypervisors GET endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1091764 (https://phabricator.wikimedia.org/T380069) [16:35:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091763 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [16:35:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091764 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [16:41:55] (03CR) 10Andrew Bogott: [C:03+2] nova policy: allow public access to os-hypervisors GET endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1091764 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [16:42:01] (03CR) 10Andrew Bogott: [C:03+2] Neutron: allow observer user to get IP availability [puppet] - 10https://gerrit.wikimedia.org/r/1091763 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [16:43:45] (03CR) 10Bking: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 (owner: 10Aleksandar Mastilovic) [16:54:01] RECOVERY - Disk space on Hadoop worker on an-worker1120 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:55:21] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:55:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:56:46] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@82083c4]: image suggestions hotfix - section titles denylist dependency [16:57:07] (03PS1) 10Andrew Bogott: wmcs-image-create: specify a cloud config to use [puppet] - 10https://gerrit.wikimedia.org/r/1091771 [16:57:13] !log copy python3-flask-{keystone,oslolog} from bullseye-wikimedia to bookworm-wikimedia [16:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:04] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@82083c4]: image suggestions hotfix - section titles denylist dependency (duration: 01m 58s) [17:01:28] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10327625 (10Eevans) Ok, the bot's registration is now associated with sre-onfire@ {F57705416} Done! [17:02:04] (03CR) 10Andrew Bogott: [C:03+2] wmcs-image-create: specify a cloud config to use [puppet] - 10https://gerrit.wikimedia.org/r/1091771 (owner: 10Andrew Bogott) [17:05:37] (03PS1) 10Majavah: hieradata: Add bastion-codfw1dev-04 to cumin_masters too [puppet] - 10https://gerrit.wikimedia.org/r/1091772 [17:06:36] (03CR) 10Majavah: [C:03+2] hieradata: Add bastion-codfw1dev-04 to cumin_masters too [puppet] - 10https://gerrit.wikimedia.org/r/1091772 (owner: 10Majavah) [17:12:05] (03PS1) 10Andrew Bogott: hieradata: Update bastion-codfw1dev-03 IP [puppet] - 10https://gerrit.wikimedia.org/r/1091775 [17:13:50] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Update bastion-codfw1dev-03 IP [puppet] - 10https://gerrit.wikimedia.org/r/1091775 (owner: 10Andrew Bogott) [17:14:05] (03PS1) 10Bking: admin-ng: replace hdfs-synchronizer namespace w/blunderbuss (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091776 (https://phabricator.wikimedia.org/T371994) [17:16:52] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10327697 (10Eevans) 05Open→03Resolved a:03Eevans [17:21:17] (03CR) 10Bernard Wang: [C:04-1] "not yet ready to deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (owner: 10Bernard Wang) [17:23:28] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10327708 (10RobH) Window confirmed! > Comentário gerado em Smart Hands: Hi Robert, good afternoon; > We can assist you at these times. [17:45:53] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083 (10RobH) 03NEW [17:46:01] (03CR) 10Brouberol: [C:03+1] admin-ng: replace hdfs-synchronizer namespace w/blunderbuss (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091776 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [17:46:58] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10327817 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers... [17:47:30] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10327840 (10RobH) [17:52:41] (03CR) 10Brouberol: [C:04-1] airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [17:54:14] (03CR) 10Brouberol: [C:04-1] airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [17:56:37] (03PS1) 10Jasmine: wikikube: put kubestage2003 and 2004 into production [puppet] - 10https://gerrit.wikimedia.org/r/1091783 [17:57:58] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091784 (https://phabricator.wikimedia.org/T369350) [18:00:24] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091785 (https://phabricator.wikimedia.org/T369350) [18:03:11] (03PS1) 10Ahmon Dancy: role::beta::deploymentserver: Ensure jenkins-deploy account is member of docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [18:05:33] (03CR) 10CI reject: [V:04-1] role::beta::deploymentserver: Ensure jenkins-deploy account is member of docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [18:06:55] (03PS2) 10Ahmon Dancy: role::beta::deploymentserver: Add jenkins-deploy to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [18:08:29] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:09:03] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:12:51] (03PS2) 10Jasmine: wikikube-staging: put kubestage2003 and 2004 into production [puppet] - 10https://gerrit.wikimedia.org/r/1091783 (https://phabricator.wikimedia.org/T377011) [18:14:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:15:01] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:15:14] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:18:55] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:18:59] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10327900 (10elukey) I was able to upload the firmware via Web UI, but the issue seems still present (new version, `01.04.08`. Need to investigate more wha... [18:19:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:21] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:25:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:26:46] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091784 (https://phabricator.wikimedia.org/T369350) (owner: 10Clare Ming) [18:26:50] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091785 (https://phabricator.wikimedia.org/T369350) (owner: 10Clare Ming) [18:28:05] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091784 (https://phabricator.wikimedia.org/T369350) (owner: 10Clare Ming) [18:28:10] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091785 (https://phabricator.wikimedia.org/T369350) (owner: 10Clare Ming) [18:29:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 3.626 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:29:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:29:49] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:30:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:31:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:31:54] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:32:12] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:32:44] (03PS3) 10Ebernhardson: WIP: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) [18:34:59] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:35:13] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:39:53] (03PS1) 10Majavah: dynamicproxy: Fix Lua path on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1091796 (https://phabricator.wikimedia.org/T379175) [18:41:50] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4542/console" [puppet] - 10https://gerrit.wikimedia.org/r/1091796 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [18:42:16] (03CR) 10Majavah: [V:03+1 C:03+2] dynamicproxy: Fix Lua path on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1091796 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [18:43:25] (03Abandoned) 10Majavah: Tools: Allow proxymanager to add and remove proxy forward entries [puppet] - 10https://gerrit.wikimedia.org/r/266448 (owner: 10Tim Landscheidt) [18:43:28] (03Abandoned) 10Majavah: Tools: Switch portgrabber and portreleaser to proxymanager [puppet] - 10https://gerrit.wikimedia.org/r/268279 (owner: 10Tim Landscheidt) [18:43:30] (03Abandoned) 10Majavah: Tools: Decommission proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/268346 (owner: 10Tim Landscheidt) [18:43:32] (03Abandoned) 10Majavah: Tools: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/268347 (owner: 10Tim Landscheidt) [18:44:08] (03CR) 10Ahmon Dancy: "Tested on deployment-deploy04.deployment-prep via cherry pick on deployment-puppetserver-1.deployment-prep." [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [18:44:57] (03PS3) 10Ahmon Dancy: role::beta::deploymentserver: Add jenkins-deploy to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [18:46:03] (03CR) 10Cwhite: [C:03+2] sre: enable all DCs to complain about Puppet issues [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite) [18:46:36] (03CR) 10Bking: [C:03+2] admin-ng: replace hdfs-synchronizer namespace w/blunderbuss (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091776 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:47:37] (03PS1) 10Majavah: hieradata: Update codfw1dev cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/1091798 (https://phabricator.wikimedia.org/T379175) [18:49:08] 06SRE, 10vm-requests, 07Kubernetes: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10328009 (10herron) 05Open→03Resolved I think we're good here! ` NAME STATUS ROLES AGE VERSION aux-k8s-ctrl1002.eqiad.wmnet Ready control... [18:49:15] 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10328012 (10herron) [18:50:22] (03Merged) 10jenkins-bot: admin-ng: replace hdfs-synchronizer namespace w/blunderbuss (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091776 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:50:53] thank you cwhite! [18:51:58] <3 [18:52:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:53:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:55:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10328030 (10Jclark-ctr) [19:03:38] (03PS1) 10Bking: dse-k8s-services: add net-new chart for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091801 (https://phabricator.wikimedia.org/T371994) [19:12:55] (03PS2) 10Majavah: keepalived: Split failover config template to new class [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) [19:12:55] (03PS4) 10Majavah: keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) [19:12:55] (03PS6) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [19:12:56] (03PS5) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [19:12:57] (03PS1) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) [19:16:55] (03PS2) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) [19:16:55] (03PS7) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [19:16:55] (03PS6) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [19:29:41] (03PS3) 10Aleksandar Mastilovic: Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 [19:30:42] (03PS3) 10Scott French: Add JobQueueLowTrafficRuleWidespreadHighLatency [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) [19:30:43] (03CR) 10CI reject: [V:04-1] Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 (owner: 10Aleksandar Mastilovic) [19:40:31] 10SRE-Access-Requests: Access to Data Hub - https://phabricator.wikimedia.org/T380091 (10IAckerman-WMF) 03NEW [19:40:53] (03CR) 10Bking: "Self-merging per Slack conversation with @amastilovic@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091801 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:40:57] (03CR) 10Bking: [C:03+2] dse-k8s-services: add net-new chart for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091801 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:41:59] (03Merged) 10jenkins-bot: dse-k8s-services: add net-new chart for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091801 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:45:26] (03PS3) 10Majavah: keepalived: Split failover config template to new class [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) [19:45:26] (03PS5) 10Majavah: keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) [19:45:26] (03PS3) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) [19:45:27] (03PS8) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [19:45:28] (03PS7) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [19:46:11] !log dancy@deploy2002 Installing scap version "4.124.0" for 206 hosts [19:47:49] 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10328159 (10taavi) [19:47:56] (03PS4) 10Scott French: Add JobQueueLowTrafficConsumerWidespreadHighLatency [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) [19:50:32] !log dancy@deploy2002 Installation of scap version "4.124.0" completed for 206 hosts [19:50:55] (03CR) 10Majavah: "fwiw, the prometheus alert `keep_firing_for` option would have been a bit cleaner way to implement this" [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [19:51:31] !log dancy@deploy2002 Started scap sync-world: Testing T377883 [19:51:34] T377883: Scap prometheus migration: Reduce the cardinality of scap timers/statsd metrics - https://phabricator.wikimedia.org/T377883 [19:52:25] (03CR) 10Scott French: "Thanks for the discussion, Reuven. If this seems reasonable to you, I'd like to start putting miles on this (not yet notifying), while con" [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [19:54:37] !log dancy@deploy2002 Finished scap sync-world: Testing T377883 (duration: 03m 06s) [19:55:35] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:56:45] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations, 07IPv6: Some WMCS clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271139#10328185 (10taavi) >>! In T271139#10151973, @Volans wrote: > I guess that the clouddb are expected and they **all** don't have the AAAA rec... [19:56:53] 06SRE, 10Observability-Alerting, 06Traffic: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807#10328186 (10ssingh) I will test this on Monday and then mark this as resolved but I think this should be fixed. Much thanks to @colewhite for the... [20:02:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:04:43] 06SRE-OnFire, 10Incident Tooling: corto: binary doesn't include build information - https://phabricator.wikimedia.org/T379958#10328210 (10Eevans) `lang=sh-session eevans@alert1002:~$ sudo journalctl -u corto.service [ ... ] Nov 15 20:01:32 alert1002 systemd[1]: Started corto.service - Assist SREs during incide... [20:05:26] (03PS1) 10Albertoleoncio: [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) [20:06:08] (03CR) 10CI reject: [V:04-1] [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio) [20:10:21] (03PS2) 10Albertoleoncio: [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) [20:14:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:14:27] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:15:42] 06SRE-OnFire, 10Incident Tooling: corto: binary doesn't include build information - https://phabricator.wikimedia.org/T379958#10328237 (10Eevans) 05Open→03Resolved a:03Eevans Not yet deployed, but //version// is now also a recognized command for the bot: {F57705797} See: [[ https://gitlab.wikimedia... [20:15:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio) [20:21:28] (03CR) 10RLazarus: [C:03+1] "Nice, let's see how it works in practice." [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [20:27:12] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:27:23] (03PS4) 10Ebernhardson: WIP: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) [20:31:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding restbase2036 to codfw - jhancock@cumin2002" [20:31:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding restbase2036 to codfw - jhancock@cumin2002" [20:31:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:32:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2036.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:32:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2037.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:32:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:03] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:03] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:40:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2037.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:40:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host restbase2037 [20:41:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host restbase2037 [20:42:28] (03CR) 10Scott French: "Thanks, Reuven! I'll go ahead and merge this since the combination of team and severity is non-notifying." [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [20:42:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2036.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:42:47] (03CR) 10Scott French: [C:03+2] Add JobQueueLowTrafficConsumerWidespreadHighLatency [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [20:43:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2038.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:43:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10328297 (10Jhancock.wm) [20:43:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2036'] [20:43:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2038'] [20:43:58] (03Merged) 10jenkins-bot: Add JobQueueLowTrafficConsumerWidespreadHighLatency [alerts] - 10https://gerrit.wikimedia.org/r/1091797 (https://phabricator.wikimedia.org/T378609) (owner: 10Scott French) [20:44:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2038'] [20:44:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2036'] [20:45:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2036.codfw.wmnet with OS bullseye [20:45:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2038.codfw.wmnet with OS bullseye [20:45:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10328333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host restbase2036.codfw.wmnet with OS bullseye [20:45:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10328335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host restbase2038.codfw.wmnet with OS bullseye [20:47:26] (03CR) 10Majavah: [C:03+2] hieradata: Update codfw1dev cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/1091798 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [20:48:03] (03PS1) 10Majavah: hieradata: Update codfw1dev horizon to 2024-11-15-203900 [puppet] - 10https://gerrit.wikimedia.org/r/1091816 [20:48:57] (03CR) 10Majavah: [C:03+2] hieradata: Update codfw1dev horizon to 2024-11-15-203900 [puppet] - 10https://gerrit.wikimedia.org/r/1091816 (owner: 10Majavah) [20:50:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:54:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding elastic2110 to codfw - jhancock@cumin2002" [20:54:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding elastic2110 to codfw - jhancock@cumin2002" [20:54:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:56:27] (03CR) 10Ahmon Dancy: [C:04-1] role::beta::deploymentserver: Add jenkins-deploy to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [20:56:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:56:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2111.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:56:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2112.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:56:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2113.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:56:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2114.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:56:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2115.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:57:46] (03CR) 10Jdlrobson: [C:03+1] Revert "Allow other input and changes to trigger searchsuggestions to update" [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091605 (owner: 10Samtar) [21:04:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2036.codfw.wmnet with reason: host reimage [21:07:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2112.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2115.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2038.codfw.wmnet with reason: host reimage [21:08:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2111.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:08:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2114.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:10:42] (03PS4) 10Ahmon Dancy: role::beta::deploymentserver: Add mwbuilder to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [21:10:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2036.codfw.wmnet with reason: host reimage [21:11:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2113.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:11:19] (03CR) 10CI reject: [V:04-1] role::beta::deploymentserver: Add mwbuilder to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [21:11:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2110'] [21:12:07] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2111'] [21:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2111'] [21:12:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2112'] [21:12:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2112'] [21:12:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2114'] [21:12:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2114'] [21:13:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2115'] [21:13:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2115'] [21:13:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2038.codfw.wmnet with reason: host reimage [21:14:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2111.codfw.wmnet with OS bullseye [21:14:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2112.codfw.wmnet with OS bullseye [21:14:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2114.codfw.wmnet with OS bullseye [21:14:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2115.codfw.wmnet with OS bullseye [21:14:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2111.co... [21:14:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2112.co... [21:14:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328384 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2114.co... [21:14:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2115.co... [21:17:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:14] (03PS5) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [21:20:45] 06SRE, 10LDAP-Access-Requests: Access to Data Hub - IAckerman-WMF - https://phabricator.wikimedia.org/T380091#10328405 (10Peachey88) [21:20:50] (03CR) 10CI reject: [V:04-1] role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [21:22:29] (03PS6) 10Ahmon Dancy: role::beta::deploymentserver: Populate docker group [puppet] - 10https://gerrit.wikimedia.org/r/1091787 [21:22:36] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:23:53] (03PS1) 10Reedy: noc: Expose MobileUrlCallback.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091818 [21:27:46] (03CR) 10Ahmon Dancy: "Ready" [puppet] - 10https://gerrit.wikimedia.org/r/1091787 (owner: 10Ahmon Dancy) [21:28:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:29:55] 06SRE, 10LDAP-Access-Requests: Access to Data Hub - IAckerman-WMF - https://phabricator.wikimedia.org/T380091#10328422 (10greg) Ilse's manager (Erica Roden, Dir of Fundraising Operations) is out on parental leave until May. I can give my approval for this request as a peer of Erica's. [21:30:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2111.codfw.wmnet with reason: host reimage [21:30:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2114.codfw.wmnet with reason: host reimage [21:30:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2115.codfw.wmnet with reason: host reimage [21:33:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2111.codfw.wmnet with reason: host reimage [21:33:55] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:35:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:35:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2036.codfw.wmnet with OS bullseye [21:35:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10328429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host restbase2036.codfw.wmnet with OS bullseye completed: - restbase203... [21:35:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2114.codfw.wmnet with reason: host reimage [21:35:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:35:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2038.codfw.wmnet with OS bullseye [21:36:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10328430 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host restbase2038.codfw.wmnet with OS bullseye completed: - restbase203... [21:38:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2115.codfw.wmnet with reason: host reimage [21:47:41] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.023e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:50:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:50:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:50:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2111.codfw.wmnet with OS bullseye [21:51:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2111.codfw.... [21:53:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:53:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:53:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2114.codfw.wmnet with OS bullseye [21:53:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2114.codfw.... [21:56:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:56:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:56:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2115.codfw.wmnet with OS bullseye [21:56:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328496 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2115.codfw.... [21:59:09] !log Started MediaModeration scan on commons wiki attempting to scan all failed to be scanned images - https://wikitech.wikimedia.org/wiki/MediaModeration [21:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:27] !log Started MediaModeration scan on all wikis other than commonswiki attempting to scan all failed to be scanned images - https://wikitech.wikimedia.org/wiki/MediaModeration [21:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:13:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:13:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:17:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:17:29] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:34:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2112.codfw.wmnet with OS bullseye [22:34:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10328544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2112.codfw.... [22:38:52] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10328575 (10Jclark-ctr) [22:39:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10328577 (10Jclark-ctr) a:03Jclark-ctr [22:41:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:41:24] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10328582 (10Jclark-ctr) a:03Jclark-ctr [22:44:13] (03PS1) 10Andrew Bogott: neutron policy.yaml: correct get_network_ip_availability rule name [puppet] - 10https://gerrit.wikimedia.org/r/1091826 (https://phabricator.wikimedia.org/T380069) [22:44:45] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10328585 (10Jclark-ctr) [22:45:06] (03CR) 10Andrew Bogott: [C:03+2] neutron policy.yaml: correct get_network_ip_availability rule name [puppet] - 10https://gerrit.wikimedia.org/r/1091826 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [22:45:33] (03PS1) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [22:47:35] (03CR) 10CI reject: [V:04-1] dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [23:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 837.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:03:45] (03PS2) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [23:04:40] (03CR) 10CI reject: [V:04-1] dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [23:05:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 837.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:15:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:19:04] !log removing 3 files for legal compliance [23:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 45299256 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:32:39] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:42:41] !log removing 1 file for legal compliance [23:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log