[00:04:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090575 [00:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090575 (owner: 10TrainBranchBot) [00:54:58] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@58d7b82]: (no justification provided) [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090577 [01:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090577 (owner: 10TrainBranchBot) [01:12:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090575 (owner: 10TrainBranchBot) [01:38:59] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090577 (owner: 10TrainBranchBot) [01:49:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:42] (03PS4) 10Scott French: P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) [02:39:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:55:54] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@58d7b82]: failedpythonlol [02:55:56] !log tchin@deploy2002 deploy aborted: failedpythonlol (duration: 00m 05s) [02:56:11] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@58d7b82]: (no justification provided) [02:56:22] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@58d7b82]: (no justification provided) (duration: 00m 10s) [03:04:42] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:44] (03CR) 10Tim Starling: [C:03+1] Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [04:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:16] (03PS1) 10Santhosh: recommendation-api-ng: fix wikidata host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090593 (https://phabricator.wikimedia.org/T379592) [06:41:10] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM; I'd add a TODO comment about DRY, but you can merge as-is." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [06:46:11] (03CR) 10Giuseppe Lavagetto: "IMHO you're being too cautious. Restarting changeprop multiple times will have a bigger impact than a rolling restart would. I'd go as far" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T0700) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:04:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:45:21] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Further improve DMARC compatibility on lists.wikimedia.org - https://phabricator.wikimedia.org/T379517#10315453 (10Ladsgroup) Adding the team owning it and @jhathaway who is migrating the MTA from exim4 to postfix. [07:45:49] (03PS2) 10Slyngshede: C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 [07:47:49] !log running extensions/Echo/maintenance/removeOrphanedEvents.php --force on all wikis (T308084) [07:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:53] T308084: Reduce DB space used by Echo notifications - https://phabricator.wikimedia.org/T308084 [07:55:56] (03CR) 10Ladsgroup: [C:03+1] profile::mariadb::ferm_lists: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1089607 (owner: 10Muehlenhoff) [07:57:55] (03PS2) 10KartikMistry: recommendation-api-ng: fix wikidata host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090593 (https://phabricator.wikimedia.org/T379592) (owner: 10Santhosh) [07:59:43] (03CR) 10Muehlenhoff: "LGTM, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [08:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T0800) [08:00:05] Hamishcz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:43] * Hamishcz waves [08:00:54] (03CR) 10KartikMistry: [C:03+2] recommendation-api-ng: fix wikidata host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090593 (https://phabricator.wikimedia.org/T379592) (owner: 10Santhosh) [08:01:58] (03Merged) 10jenkins-bot: recommendation-api-ng: fix wikidata host header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090593 (https://phabricator.wikimedia.org/T379592) (owner: 10Santhosh) [08:02:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090493 (https://phabricator.wikimedia.org/T379613) (owner: 10Hamish) [08:03:26] (03Merged) 10jenkins-bot: Revert "cswiki: Add celebration logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090493 (https://phabricator.wikimedia.org/T379613) (owner: 10Hamish) [08:04:18] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1090493|Revert "cswiki: Add celebration logo"]] [08:05:36] (03CR) 10Muehlenhoff: [C:03+2] profile::mariadb::ferm_lists: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1089607 (owner: 10Muehlenhoff) [08:06:14] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:07:03] !log ladsgroup@deploy2002 ladsgroup, hamishz: Backport for [[gerrit:1090493|Revert "cswiki: Add celebration logo"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:07:51] confirmed good [08:08:48] cool, moving forward [08:08:51] !log ladsgroup@deploy2002 ladsgroup, hamishz: Continuing with sync [08:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:55] (03CR) 10Muehlenhoff: [C:03+2] tcpircbot: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1090437 (owner: 10Muehlenhoff) [08:12:55] (03PS1) 10Brouberol: growthbook: make sure the /data/db folder is writeable by runuser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090792 (https://phabricator.wikimedia.org/T379711) [08:13:36] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090493|Revert "cswiki: Add celebration logo"]] (duration: 09m 18s) [08:14:02] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [08:24:25] (03PS1) 10Stevemunene: ATS: add mapping for airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1090794 (https://phabricator.wikimedia.org/T378438) [08:25:02] (03CR) 10CI reject: [V:04-1] ATS: add mapping for airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1090794 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [08:25:47] (03PS2) 10Stevemunene: ATS: add mapping for airflow-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1090794 (https://phabricator.wikimedia.org/T378438) [08:27:10] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [08:27:20] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [08:28:55] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090794 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [08:29:03] (03CR) 10Fabfur: [C:03+2] haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) (owner: 10Fabfur) [08:30:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10315501 (10elukey) @jhathaway another episode of the saga, ms-be2088 :D I tried to reimage it to see if the last patch of reimage to force Hdd after deb... [08:31:51] (03PS1) 10Ladsgroup: beta: Set the ratio of the new ParserCache keys to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090795 (https://phabricator.wikimedia.org/T373037) [08:32:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [08:33:57] (03PS1) 10Brouberol: airflow-research: provision values and helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) [08:34:08] (03CR) 10Ladsgroup: [C:03+2] beta: Set the ratio of the new ParserCache keys to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090795 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [08:34:49] (03Merged) 10jenkins-bot: beta: Set the ratio of the new ParserCache keys to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090795 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [08:40:31] (03CR) 10Brouberol: [C:03+1] "Let's only merge this when the webserver is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1090794 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [08:46:13] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:49:10] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bullseye [08:49:24] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:50:04] (03CR) 10JMeybohm: [C:03+2] etcd::v3: Ensure etcd peers srange is sorted [puppet] - 10https://gerrit.wikimedia.org/r/1090467 (owner: 10JMeybohm) [08:50:53] (03PS1) 10Muehlenhoff: Revise Envoy firewall options (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1090798 [08:54:06] (03PS3) 10Slyngshede: C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 [08:54:18] (03CR) 10Slyngshede: C:ldap::management: default members to empty list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [08:54:40] !log Updated recommedation-api to 2024-11-08-142328-production and fix wikidata host header (T379592) [08:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:43] T379592: Unable to deploy new version of recommendation-api to production due to connectivity issues - https://phabricator.wikimedia.org/T379592 [08:54:50] (03PS2) 10Brouberol: airflow-research: provision values and helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) [08:54:51] (03PS1) 10Brouberol: Provision airflow-research namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090801 (https://phabricator.wikimedia.org/T378442) [08:54:52] (03PS1) 10Brouberol: add airflow-research namespace to the list of ceph/cloudnative tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090802 (https://phabricator.wikimedia.org/T378442) [08:56:14] (03CR) 10Slyngshede: [C:03+2] Filter out none posixGroup "group" in next_gid_number. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1090459 (owner: 10Slyngshede) [08:56:31] (03CR) 10CI reject: [V:04-1] C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [08:58:02] (03PS1) 10Brouberol: airflow-research: define OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1090803 (https://phabricator.wikimedia.org/T378442) [08:58:03] (03PS1) 10Brouberol: airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) [08:58:07] (03Merged) 10jenkins-bot: Filter out none posixGroup "group" in next_gid_number. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1090459 (owner: 10Slyngshede) [08:59:03] (03PS4) 10Slyngshede: C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 [09:00:05] brennen and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T0900). [09:01:04] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [09:01:23] (03PS1) 10JMeybohm: k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp [cookbooks] - 10https://gerrit.wikimedia.org/r/1090806 (https://phabricator.wikimedia.org/T362408) [09:01:24] (03CR) 10CI reject: [V:04-1] C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [09:02:51] (03PS2) 10Slyngshede: Account blocking: blocking should not fail if account is not blocked [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) [09:03:06] (03CR) 10Slyngshede: Account blocking: blocking should not fail if account is not blocked (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [09:04:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:08:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:08:59] fabfur: ^^please silence this one [09:11:42] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bullseye [09:12:26] (03CR) 10Slyngshede: [C:03+2] Start migrating Netbox alerts from Icinga. [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:13:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [09:14:05] (03Merged) 10jenkins-bot: Start migrating Netbox alerts from Icinga. [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:16:34] (03CR) 10Vgutierrez: "as mentioned on IRC you need to create a component for the new versions, otherwise as soon as you upload the new ones to apt.wm.o puppet w" [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [09:20:08] !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [09:20:57] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bookworm [09:22:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:22:42] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:08] jayme: k8s control plane reimage triggers those BGP alerts? [09:23:09] bgp errors is me [09:23:12] ack [09:25:52] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye [09:26:31] (03CR) 10Ayounsi: Expose IPsec tunnel configuration from Netbox (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [09:30:56] (03PS2) 10Muehlenhoff: Revise Envoy firewall options (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1090798 [09:36:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090798 (owner: 10Muehlenhoff) [09:38:31] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [09:41:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage [09:47:19] (03CR) 10Nikerabbit: Add new namespaces to hsb wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [09:51:33] (03CR) 10Clément Goubert: [C:03+1] thumbor: fail health check if healthy servers is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089832 (https://phabricator.wikimedia.org/T379561) (owner: 10Hnowlan) [09:53:06] (03PS1) 10Muehlenhoff: Add three new Airflow LDAP groups to the list of groups considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) [09:53:45] (03CR) 10CI reject: [V:04-1] Add three new Airflow LDAP groups to the list of groups considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [09:55:31] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: host reimage [09:56:10] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [09:56:36] (03PS2) 10Muehlenhoff: Add three new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) [10:00:17] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [10:00:44] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [10:00:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [10:00:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2088.codfw.wmnet with OS bullseye [10:01:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: host reimage [10:03:30] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717 (10JMeybohm) 03NEW [10:03:46] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10315693 (10JMeybohm) [10:04:08] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10315696 (10JMeybohm) [10:04:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:32] !log Manual restart of dump_cloud_ip_ranges.service on 'A:puppetserver or A:puppetmaster' [10:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:55] (03CR) 10Elukey: [V:03+1 C:03+2] docker_registry_ha: allow /v2/_catalog only for internal clients [puppet] - 10https://gerrit.wikimedia.org/r/1090450 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [10:05:29] (03CR) 10Brouberol: [C:03+1] Add three new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [10:07:03] (03CR) 10Btullis: [C:03+1] Add three new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [10:07:25] (03PS1) 10Ladsgroup: Set the ratio of the new ParserCache keys to 100 for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090809 (https://phabricator.wikimedia.org/T373037) [10:07:39] (03CR) 10Slyngshede: [C:03+2] Account blocking: blocking should not fail if account is not blocked [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [10:09:25] RESOLVED: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:41] (03CR) 10Stevemunene: [C:03+1] airflow-research: define OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1090803 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [10:09:55] !log disallow calls to /v2/_catalog from the outside internet on Docker Registry hosts - T378618 [10:09:56] (03Merged) 10jenkins-bot: Account blocking: blocking should not fail if account is not blocked [software/bitu] - 10https://gerrit.wikimedia.org/r/1090422 (https://phabricator.wikimedia.org/T378693) (owner: 10Slyngshede) [10:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:14] (03PS1) 10Muehlenhoff: Add missing stub secrets for airflow_wmde and airflow_research [labs/private] - 10https://gerrit.wikimedia.org/r/1090810 (https://phabricator.wikimedia.org/T379267) [10:10:14] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:48] (03CR) 10Btullis: [C:03+1] airflow-research: define OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1090803 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [10:11:45] (03CR) 10Stevemunene: [C:03+1] Add missing stub secrets for airflow_wmde and airflow_research [labs/private] - 10https://gerrit.wikimedia.org/r/1090810 (https://phabricator.wikimedia.org/T379267) (owner: 10Muehlenhoff) [10:14:35] (03CR) 10Stevemunene: [C:03+1] Add three new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [10:15:23] jouncebot: nowandnext [10:15:23] For the next 0 hour(s) and 44 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T0900) [10:15:23] In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1100) [10:15:52] (03CR) 10Ladsgroup: [C:03+2] Set the ratio of the new ParserCache keys to 100 for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090809 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [10:15:55] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1090803 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [10:16:33] (03Merged) 10jenkins-bot: Set the ratio of the new ParserCache keys to 100 for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090809 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [10:17:29] (03CR) 10Brouberol: [C:03+2] airflow-research: define OIDC configuration [puppet] - 10https://gerrit.wikimedia.org/r/1090803 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [10:17:40] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1090809|Set the ratio of the new ParserCache keys to 100 for prod (T373037)]] [10:17:44] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [10:18:20] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [10:19:02] (03CR) 10Ladsgroup: "Thanks!" [software] - 10https://gerrit.wikimedia.org/r/1089831 (https://phabricator.wikimedia.org/T379519) (owner: 10Jcrespo) [10:20:11] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1090809|Set the ratio of the new ParserCache keys to 100 for prod (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:20:14] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [10:21:32] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:22:58] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add missing stub secrets for airflow_wmde and airflow_research [labs/private] - 10https://gerrit.wikimedia.org/r/1090810 (https://phabricator.wikimedia.org/T379267) (owner: 10Muehlenhoff) [10:23:48] (03PS3) 10Muehlenhoff: Revise Envoy firewall options [puppet] - 10https://gerrit.wikimedia.org/r/1090798 [10:24:07] (03CR) 10Muehlenhoff: "PCC failure for idp* is unrelated and addressed by https://gerrit.wikimedia.org/r/c/labs/private/+/1090810" [puppet] - 10https://gerrit.wikimedia.org/r/1090798 (owner: 10Muehlenhoff) [10:24:11] (03CR) 10Muehlenhoff: [C:03+2] Add three new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1090807 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [10:24:26] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1002.eqiad.wmnet with OS bookworm [10:24:32] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [10:25:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090526 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:26:33] !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [10:26:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10315722 (10JMeybohm) 05Open→03Resolved Resolving this. We're going to fix the others in T379717 [10:26:37] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:26:50] RECOVERY - Juniper alarms on fasw-c-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:27:31] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Research allowing read-only access to the superset api from requestctl's web UI - https://phabricator.wikimedia.org/T379718 (10Joe) 03NEW [10:27:48] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bookworm [10:28:23] (03PS1) 10Effie Mouzeli: ipoid: bumping limits for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090812 [10:29:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:29:48] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:29:54] PROBLEM - Juniper alarms on fasw-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.30 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:30:06] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719 (10JMeybohm) 03NEW [10:32:28] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [10:32:36] my deployment has failed with this [10:32:41] https://www.irccloud.com/pastebin/rtlJf1b8/ [10:32:50] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:33:06] Is it okay to retry? jayme might be related to your work? [10:33:33] Amir1: hmpf, sorry - yes [10:33:38] please try again [10:33:38] more details [10:33:41] https://www.irccloud.com/pastebin/08NOkQkM/ [10:34:27] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1090809|Set the ratio of the new ParserCache keys to 100 for prod (T373037)]] [10:34:30] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [10:35:12] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [10:36:53] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1090809|Set the ratio of the new ParserCache keys to 100 for prod (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:37:16] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:37:52] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [10:38:50] PROBLEM - Juniper virtual chassis ports on fasw-c-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [10:40:48] RECOVERY - Juniper virtual chassis ports on fasw-c-eqiad is OK: OK: UP: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [10:41:59] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090809|Set the ratio of the new ParserCache keys to 100 for prod (T373037)]] (duration: 07m 32s) [10:42:04] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [10:42:39] (03PS1) 10Btullis: Remove base::cloud_production profile from an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1090813 (https://phabricator.wikimedia.org/T379571) [10:43:36] (03PS1) 10Fabfur: hiera: enable haproxykafka on cp6001 [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) [10:43:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4510/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090813 (https://phabricator.wikimedia.org/T379571) (owner: 10Btullis) [10:44:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1090813 (https://phabricator.wikimedia.org/T379571) (owner: 10Btullis) [10:46:05] (03CR) 10FNegri: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1090813 (https://phabricator.wikimedia.org/T379571) (owner: 10Btullis) [10:46:18] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:48:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:51:37] (03PS1) 10Elukey: Revert "docker_registry_ha: allow /v2/_catalog only for internal clients" [puppet] - 10https://gerrit.wikimedia.org/r/1090817 [10:51:57] (03CR) 10Hnowlan: [C:03+2] changeprop-jobqueue: increase webVideoTranscode concurrency to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090567 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [10:52:14] (03CR) 10CI reject: [V:04-1] Revert "docker_registry_ha: allow /v2/_catalog only for internal clients" [puppet] - 10https://gerrit.wikimedia.org/r/1090817 (owner: 10Elukey) [10:52:50] (03PS1) 10Muehlenhoff: Blacklist jfs [puppet] - 10https://gerrit.wikimedia.org/r/1090818 [10:52:58] (03Merged) 10jenkins-bot: changeprop-jobqueue: increase webVideoTranscode concurrency to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090567 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [10:53:48] (03CR) 10Elukey: [V:03+1 C:03+2] "To keep archives happy - this change is a no-op, since all the IPs that nginx sees are always 10.x. The requests from the outside internet" [puppet] - 10https://gerrit.wikimedia.org/r/1090450 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [10:54:35] (03PS2) 10Elukey: Revert "docker_registry_ha: allow /v2/_catalog only for internal clients" [puppet] - 10https://gerrit.wikimedia.org/r/1090817 [10:54:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1100) [11:03:04] (03CR) 10Elukey: [C:03+2] Revert "docker_registry_ha: allow /v2/_catalog only for internal clients" [puppet] - 10https://gerrit.wikimedia.org/r/1090817 (owner: 10Elukey) [11:04:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [11:04:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:06:42] FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:33] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [11:10:05] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [11:12:07] (03CR) 10Btullis: [V:03+1 C:03+2] Remove base::cloud_production profile from an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1090813 (https://phabricator.wikimedia.org/T379571) (owner: 10Btullis) [11:13:44] (03PS1) 10Muehlenhoff: Add ganeti1051/ganeti1052 [puppet] - 10https://gerrit.wikimedia.org/r/1090821 (https://phabricator.wikimedia.org/T378921) [11:14:07] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: host reimage [11:14:52] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [11:15:18] (03CR) 10Btullis: [C:03+1] airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [11:15:40] (03CR) 10Btullis: [C:03+1] growthbook: make sure the /data/db folder is writeable by runuser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090792 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [11:16:05] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088543 (https://phabricator.wikimedia.org/T379265) (owner: 10Brouberol) [11:17:19] (03CR) 10Elukey: [C:03+1] Blacklist jfs [puppet] - 10https://gerrit.wikimedia.org/r/1090818 (owner: 10Muehlenhoff) [11:17:53] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: host reimage [11:18:34] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [11:19:28] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [11:22:19] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10315986 (10Clement_Goubert) [11:24:03] (03PS5) 10Arnaudb: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) [11:24:03] (03CR) 10Arnaudb: "this is a basic proposition on testing that a host has properly been depooled" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [11:24:17] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1051/ganeti1052 [puppet] - 10https://gerrit.wikimedia.org/r/1090821 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [11:25:10] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1256.eqiad.wmnet [11:25:20] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10315991 (10ops-monitoring-bot) depool host wikikube-worker1256.eqiad.wmnet by cgoubert@cumin1002 with reason: Degraded RAID [11:25:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1256.eqiad.wmnet [11:25:51] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10315993 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker1256.eqiad.wmnet completed: - wikikube-worker1256.eqia... [11:26:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wikikube-worker1256.eqiad.wmnet with reason: Degraded RAID [11:26:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker1256.eqiad.wmnet with reason: Degraded RAID [11:26:49] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10316001 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b845f658-b5b1-44ba-b75b-ce7430a01e60) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their service... [11:27:43] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10316003 (10Clement_Goubert) Host depooled and downtimed, you can replace the disk when able. [11:34:03] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:34:25] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:26] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:39:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:16] (03CR) 10Nikerabbit: [C:04-1] Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [11:41:16] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1003.eqiad.wmnet with OS bookworm [11:41:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster wikikube-eqiad: containerd migration [11:45:35] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:45:36] (03CR) 10Alexandros Kosiaris: [C:03+2] deployment-charts: Remove irc1002/irc2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:45:57] (03CR) 10Alexandros Kosiaris: [C:03+2] "Will make it out on the next deployment. The dependent commit has already been merged and deployed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:46:05] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1051 [11:46:44] (03PS1) 10Btullis: Add missing params to enable druid test to delete unused segments [puppet] - 10https://gerrit.wikimedia.org/r/1090828 (https://phabricator.wikimedia.org/T376118) [11:46:55] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:47:18] (03Merged) 10jenkins-bot: deployment-charts: Remove irc1002/irc2002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:47:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1051 [11:47:27] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4511/console" [puppet] - 10https://gerrit.wikimedia.org/r/1090828 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [11:48:26] (03PS1) 10Clément Goubert: wikikube: Add wikikube-worker21[28-35] [puppet] - 10https://gerrit.wikimedia.org/r/1090829 (https://phabricator.wikimedia.org/T377008) [11:48:43] (03CR) 10Alexandros Kosiaris: [C:03+1] ipoid: bumping limits for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090812 (owner: 10Effie Mouzeli) [11:48:44] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:48:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance [11:49:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance [11:49:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet [11:49:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es1022 (T376905)', diff saved to https://phabricator.wikimedia.org/P71026 and previous config saved to /var/cache/conftool/dbconfig/20241113-114913-ladsgroup.json [11:49:20] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: bumping limits for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090812 (owner: 10Effie Mouzeli) [11:49:28] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:50:27] (03Merged) 10jenkins-bot: ipoid: bumping limits for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090812 (owner: 10Effie Mouzeli) [11:50:49] (03CR) 10JMeybohm: [C:03+1] wikikube: Add wikikube-worker21[28-35] [puppet] - 10https://gerrit.wikimedia.org/r/1090829 (https://phabricator.wikimedia.org/T377008) (owner: 10Clément Goubert) [11:50:56] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:51:39] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:51:53] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:52:55] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:53:05] (03CR) 10Muehlenhoff: "thx" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089751 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:53:18] jouncebot: nowandnext [11:53:18] For the next 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1100) [11:53:18] In 0 hour(s) and 6 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1200) [11:54:06] (03PS1) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) [11:54:32] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1052 [11:54:49] (03CR) 10CI reject: [V:04-1] GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [11:55:00] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089700 (owner: 10PipelineBot) [11:55:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1052 [11:55:57] (03PS2) 10Btullis: Add missing params to enable druid test to delete unused segments [puppet] - 10https://gerrit.wikimedia.org/r/1090828 (https://phabricator.wikimedia.org/T376118) [11:56:02] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089700 (owner: 10PipelineBot) [11:56:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet [11:56:39] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4512/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090828 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [11:57:02] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:57:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet [11:57:19] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:57:36] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:57:38] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:58:47] (03PS1) 10Btullis: Revert "Remove labswiki from HDFS imported dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) [11:59:25] (03PS2) 10Btullis: Revert "Remove labswiki from HDFS imported dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) [11:59:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T376905)', diff saved to https://phabricator.wikimedia.org/P71027 and previous config saved to /var/cache/conftool/dbconfig/20241113-115943-ladsgroup.json [12:00:05] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1200). [12:00:35] (03PS3) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089701 (owner: 10PipelineBot) [12:01:16] (03CR) 10Clément Goubert: [C:03+2] wikikube: Add wikikube-worker21[28-35] [puppet] - 10https://gerrit.wikimedia.org/r/1090829 (https://phabricator.wikimedia.org/T377008) (owner: 10Clément Goubert) [12:01:16] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:02:01] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:03:04] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:03:25] (03CR) 10Stevemunene: [C:03+2] "Webserver up and running" [puppet] - 10https://gerrit.wikimedia.org/r/1090794 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [12:03:39] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:05:10] (03PS4) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089701 (owner: 10PipelineBot) [12:06:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet [12:06:20] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:06:57] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:09:57] (03CR) 10Mvolz: [C:03+2] zotero: Switch image from gerrit- to GitLab-hosted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088330 (https://phabricator.wikimedia.org/T374558) (owner: 10Jforrester) [12:11:01] (03Merged) 10jenkins-bot: zotero: Switch image from gerrit- to GitLab-hosted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088330 (https://phabricator.wikimedia.org/T374558) (owner: 10Jforrester) [12:11:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:11:47] (03PS1) 10Btullis: [dumps] - Categorise labswiki (wikitech) as a big wiki [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) [12:11:50] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [12:11:53] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:12:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4513/console" [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) (owner: 10Btullis) [12:12:38] (03PS2) 10Btullis: [dumps] - Categorise labswiki (wikitech) as a big wiki [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) [12:13:05] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [12:13:19] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4514/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) (owner: 10Btullis) [12:13:40] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:14:14] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [12:14:41] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [12:14:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P71028 and previous config saved to /var/cache/conftool/dbconfig/20241113-121450-ladsgroup.json [12:15:05] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [12:15:35] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [12:16:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:16:38] (03CR) 10Btullis: [V:03+1 C:03+2] Add missing params to enable druid test to delete unused segments [puppet] - 10https://gerrit.wikimedia.org/r/1090828 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [12:18:00] (03PS1) 10Stevemunene: fix values symlink and ops group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090838 (https://phabricator.wikimedia.org/T378438) [12:18:43] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [12:20:37] (03CR) 10Btullis: [C:03+1] fix values symlink and ops group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090838 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [12:24:36] (03CR) 10Stevemunene: [C:03+2] fix values symlink and ops group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090838 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [12:25:39] (03Merged) 10jenkins-bot: fix values symlink and ops group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090838 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [12:28:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [12:28:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2128.codfw.wmnet with OS bookworm [12:29:21] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [12:29:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2129.codfw.wmnet with OS bookworm [12:29:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P71029 and previous config saved to /var/cache/conftool/dbconfig/20241113-122957-ladsgroup.json [12:30:30] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [12:31:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2130.codfw.wmnet with OS bookworm [12:31:12] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [12:31:42] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:32:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1051.eqiad.wmnet to cluster eqiad and group D [12:32:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2131.codfw.wmnet with OS bookworm [12:33:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1051.eqiad.wmnet to cluster eqiad and group D [12:37:45] (03CR) 10Jelto: [C:03+1] "lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1090806 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:43:31] (03PS1) 10Stevemunene: airflow-wmde: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090840 (https://phabricator.wikimedia.org/T378438) [12:44:56] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2128.codfw.wmnet with OS bookworm [12:45:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T376905)', diff saved to https://phabricator.wikimedia.org/P71030 and previous config saved to /var/cache/conftool/dbconfig/20241113-124504-ladsgroup.json [12:45:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2128.codfw.wmnet with OS bookworm [12:50:32] (03CR) 10Hnowlan: [C:03+2] thumbor: fail health check if healthy servers is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089832 (https://phabricator.wikimedia.org/T379561) (owner: 10Hnowlan) [12:50:35] (03CR) 10Urbanecm: [C:04-1] "unfortunately, defining variables in ext-* files isn't supported. at this point, the blob needs to be in CS.php, but only active at the ri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [12:52:01] (03Merged) 10jenkins-bot: thumbor: fail health check if healthy servers is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089832 (https://phabricator.wikimedia.org/T379561) (owner: 10Hnowlan) [12:54:22] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2128.codfw.wmnet with OS bookworm [12:54:43] (03PS1) 10Btullis: Enable deletion of unused segments on the druid-analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1090842 (https://phabricator.wikimedia.org/T376118) [12:55:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2128.codfw.wmnet with OS bookworm [12:55:34] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10316407 (10MoritzMuehlenhoff) Final status update: The VMs with the legacy setup ha... [12:55:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4515/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090842 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [12:56:04] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [12:56:10] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:59:18] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2129.codfw.wmnet with OS bookworm [13:01:27] (03CR) 10Brouberol: [C:03+1] airflow-wmde: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090840 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [13:02:15] (03CR) 10Brouberol: [C:03+2] airflow: remove fsGroup stanzas as all containers are running with the same uid/gid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088543 (https://phabricator.wikimedia.org/T379265) (owner: 10Brouberol) [13:02:32] (03CR) 10Brouberol: [C:03+2] growthbook: make sure the /data/db folder is writeable by runuser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090792 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [13:03:04] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:04:22] (03CR) 10Stevemunene: [C:03+2] airflow-wmde: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090840 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [13:05:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:05:24] (03Merged) 10jenkins-bot: airflow-wmde: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090840 (https://phabricator.wikimedia.org/T378438) (owner: 10Stevemunene) [13:05:31] (03CR) 10Stevemunene: [C:03+1] "looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090801 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:05:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:05:52] (03CR) 10Stevemunene: [C:03+1] add airflow-research namespace to the list of ceph/cloudnative tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090802 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:06:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:06:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:06:44] (03CR) 10Stevemunene: airflow-research: provision values and helmfiles (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:07:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:07:07] RECOVERY - Host fasw-c-eqiad is UP: PING WARNING - Packet loss = 60%, RTA = 0.23 ms [13:07:54] (03CR) 10Brouberol: [C:03+2] Provision airflow-research namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090801 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:08:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:08:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:09:13] (03CR) 10Brouberol: [C:03+2] add airflow-research namespace to the list of ceph/cloudnative tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090802 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:09:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:10:59] (03CR) 10Brouberol: airflow-research: provision values and helmfiles (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:11:05] (03PS3) 10Brouberol: airflow-research: provision values and helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) [13:11:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:12:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:13:31] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:13:45] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:14:07] (03CR) 10Brouberol: [C:03+2] airflow-research: provision values and helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090796 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:14:33] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:15:44] (03PS2) 10Brouberol: airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) [13:17:09] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:18:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [13:18:21] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:18:46] !log installing python-cryptography security updates [13:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:08] (03CR) 10Stevemunene: [C:03+1] airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [13:20:40] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:21:23] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:23:21] (03PS3) 10Brouberol: airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) [13:23:21] (03PS1) 10Brouberol: airflow-research: provision kubernetes users [puppet] - 10https://gerrit.wikimedia.org/r/1090845 (https://phabricator.wikimedia.org/T378442) [13:30:39] (03PS2) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) [13:32:06] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [13:33:13] (03CR) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [13:35:47] (03CR) 10JHathaway: [C:03+1] Blacklist jfs [puppet] - 10https://gerrit.wikimedia.org/r/1090818 (owner: 10Muehlenhoff) [13:38:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10316600 (10jhathaway) Sorry, egg on my face, that was my fault. I commented out the auto reboot so I could do some debugging yesterday, before the reboot... [13:43:10] (03PS1) 10Daimona Eaytoy: tables-catalog: Add CampaignEvents and WikimediaCampaignEvents [puppet] - 10https://gerrit.wikimedia.org/r/1090851 (https://phabricator.wikimedia.org/T363581) [13:43:16] (03PS1) 10Slyngshede: Prevalidation of permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1090852 [13:43:36] (03PS1) 10Muehlenhoff: Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) [13:45:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:45:55] (03CR) 10Muehlenhoff: [C:03+2] Blacklist jfs [puppet] - 10https://gerrit.wikimedia.org/r/1090818 (owner: 10Muehlenhoff) [13:50:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmfkeystonehooks: create LDAP groups with project name [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) [13:50:21] (03PS1) 10Alexandros Kosiaris: ipoid: Bump CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090855 (https://phabricator.wikimedia.org/T375006) [13:50:30] (03PS2) 10Muehlenhoff: Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) [13:51:02] (03CR) 10CI reject: [V:04-1] openstack: wmfkeystonehooks: create LDAP groups with project name [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) (owner: 10Arturo Borrero Gonzalez) [13:51:19] (03Abandoned) 10Slyngshede: P:ircstream close IRC connection nicely after probe [puppet] - 10https://gerrit.wikimedia.org/r/1082212 (owner: 10Slyngshede) [13:53:36] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10316659 (10akosiaris) I ran a benchmark against `/feed/v1/ip/` because in all of the 134 errors logged in the last week in logstash, they all are for that endpoint.... [13:54:50] (03CR) 10Alexandros Kosiaris: [C:03+2] ipoid: Bump CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090855 (https://phabricator.wikimedia.org/T375006) (owner: 10Alexandros Kosiaris) [13:55:54] (03Merged) 10jenkins-bot: ipoid: Bump CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090855 (https://phabricator.wikimedia.org/T375006) (owner: 10Alexandros Kosiaris) [13:57:03] (03PS5) 10Slyngshede: C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 [13:59:07] (03PS2) 10Arturo Borrero Gonzalez: openstack: wmfkeystonehooks: create LDAP groups with project name [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) [13:59:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:59:37] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:59:48] (03CR) 10CI reject: [V:04-1] openstack: wmfkeystonehooks: create LDAP groups with project name [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) (owner: 10Arturo Borrero Gonzalez) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1400). [14:00:05] Tchanders and hnowlan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:00:10] (03PS3) 10Arturo Borrero Gonzalez: openstack: wmfkeystonehooks: create LDAP groups with project name [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) [14:00:17] (03CR) 10Btullis: [C:03+1] airflow-research: provision kubernetes users [puppet] - 10https://gerrit.wikimedia.org/r/1090845 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [14:00:27] (03CR) 10Brouberol: [C:03+2] airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [14:00:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10316711 (10Kgraessle) >>! In T379173#10300219, @thcipriani wrote: >> Reason for access: Access to stat machines in production for query load testing > > Does this mean machines... [14:00:41] o/ I'll get started on deploying my config patch [14:00:41] (03CR) 10Brouberol: airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [14:00:44] (03CR) 10Brouberol: [C:03+2] airflow-research: provision kubernetes users [puppet] - 10https://gerrit.wikimedia.org/r/1090845 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [14:01:28] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [14:01:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090515 (https://phabricator.wikimedia.org/T379503) (owner: 10Tchanders) [14:01:57] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [14:02:17] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753 (10RobH) 03NEW [14:02:39] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:02:54] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10316741 (10RobH) [14:02:54] (03Merged) 10jenkins-bot: Disallow AbuseFilter protected variables use on non-temp-user wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090515 (https://phabricator.wikimedia.org/T379503) (owner: 10Tchanders) [14:03:08] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:03:13] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10316745 (10RobH) a:03Marostegui @Marostegui, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving th... [14:03:15] o/ [14:03:25] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1090515|Disallow AbuseFilter protected variables use on non-temp-user wikis (T379503)]] [14:03:28] T379503: Disable AbuseFilter protected variables features on wikis where Temporary accounts are not about to be released - https://phabricator.wikimedia.org/T379503 [14:03:52] my change is at the jobrunner-level and so won't take impact on the test servers. It can go straight to prod [14:04:22] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.16.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090857 [14:05:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10316749 (10MoritzMuehlenhoff) [14:05:55] !log tchanders@deploy2002 tchanders: Backport for [[gerrit:1090515|Disallow AbuseFilter protected variables use on non-temp-user wikis (T379503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1052.eqiad.wmnet to cluster eqiad and group D [14:07:28] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [14:07:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1052.eqiad.wmnet to cluster eqiad and group D [14:07:46] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [14:09:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:10:01] !log tchanders@deploy2002 tchanders: Continuing with sync [14:12:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:14:53] !log tchanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090515|Disallow AbuseFilter protected variables use on non-temp-user wikis (T379503)]] (duration: 11m 28s) [14:14:56] T379503: Disable AbuseFilter protected variables features on wikis where Temporary accounts are not about to be released - https://phabricator.wikimedia.org/T379503 [14:15:22] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2128.codfw.wmnet with OS bookworm [14:15:55] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.16.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090857 (owner: 10Volans) [14:17:11] (03PS1) 10Brouberol: Revert "airflow: remove fsGroup stanzas as all containers are running with the same uid/gid" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090860 [14:17:58] (03PS2) 10Brouberol: Revert "airflow: remove fsGroup stanzas as all containers are running with the same uid/gid" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090860 (https://phabricator.wikimedia.org/T379265) [14:19:16] o/ [14:19:20] anyone else around to deploy? [14:19:23] otherwise I can do it [14:19:46] (03CR) 10Brouberol: [C:03+2] Revert "airflow: remove fsGroup stanzas as all containers are running with the same uid/gid" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090860 (https://phabricator.wikimedia.org/T379265) (owner: 10Brouberol) [14:19:57] (03CR) 10Ssingh: "I may be missing something obvious but in the past how we have handled this is by simply uploading stuff to apt.wm.org when things are rea" [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [14:19:58] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757 (10RobH) 03NEW [14:20:14] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10316859 (10RobH) [14:21:04] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10316863 (10RobH) a:03Marostegui @Marostegui, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving th... [14:21:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:21:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:22:21] Lucas_WMDE: I think all that remains is my patch if you'd be so kind. It can go straight to prod [14:23:13] (03CR) 10Vgutierrez: "that approach blocks reimaging as well, and we have some scheduled pretty soon AFAIK" [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [14:23:23] (03PS3) 10Muehlenhoff: Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) [14:23:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090526 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:24:24] (03Merged) 10jenkins-bot: TimedMediahandler: reenable shellbox-video for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090526 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:24:52] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1090526|TimedMediahandler: reenable shellbox-video for commons (T356241)]] [14:24:58] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [14:25:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:26:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:26:29] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.16.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090857 (owner: 10Volans) [14:27:19] (03CR) 10Brouberol: [C:03+2] airflow-research: define ATS redirection [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [14:27:28] !log lucaswerkmeister-wmde@deploy2002 hnowlan, lucaswerkmeister-wmde: Backport for [[gerrit:1090526|TimedMediahandler: reenable shellbox-video for commons (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:27:36] !log lucaswerkmeister-wmde@deploy2002 hnowlan, lucaswerkmeister-wmde: Continuing with sync [14:27:40] (03CR) 10Brouberol: [C:03+2] "brouberol@cp1100:~$ openssl s_client -connect airflow-research.discovery.wmnet:30443 |openssl x509 -text -noout | head" [puppet] - 10https://gerrit.wikimedia.org/r/1090804 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [14:27:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga) [14:28:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:29:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [14:30:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [14:31:18] (03PS1) 10Volans: Upstream release v8.16.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1090861 [14:32:20] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090526|TimedMediahandler: reenable shellbox-video for commons (T356241)]] (duration: 07m 28s) [14:32:23] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [14:33:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2128.codfw.wmnet with OS bookworm [14:33:20] Lucas_WMDE: thank you! [14:33:21] !lof installing openssl security updates [14:33:42] np :) [14:35:33] !log UTC afternoon backport+config window done [14:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:27] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2130.codfw.wmnet with OS bookworm [14:36:39] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2131.codfw.wmnet with OS bookworm [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:00] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10316935 (10akosiaris) After the CPU limit increase, I no longer see CPU throttling. The amount of rps served hasn't changed, suggesting it was a result of how the bench... [14:37:31] !log installing openssl security updates [14:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:33] (03PS1) 10Fabfur: hiera: enable haproxykafka on eqsin (round two) [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) [14:38:01] (03CR) 10Brouberol: [C:03+1] [dumps] - Categorise labswiki (wikitech) as a big wiki [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) (owner: 10Btullis) [14:38:44] (03PS1) 10Elukey: docker_registry_ha: log X-Client-IP from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1090867 [14:40:15] (03PS4) 10Muehlenhoff: Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) [14:40:45] (03CR) 10Slyngshede: [C:03+2] C:ldap::management: default members to empty list [puppet] - 10https://gerrit.wikimedia.org/r/1090465 (owner: 10Slyngshede) [14:41:27] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:43:01] (03CR) 10Vgutierrez: [C:03+1] docker_registry_ha: log X-Client-IP from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1090867 (owner: 10Elukey) [14:43:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:43:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10316983 (10MoritzMuehlenhoff) [14:44:10] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10316984 (10akosiaris) After re-running the benchmark with 6 threads instead of 2 and 100 connections instead of 10, I managed to make the service to finally [return som... [14:45:31] (03CR) 10Volans: [C:03+2] Upstream release v8.16.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1090861 (owner: 10Volans) [14:46:59] (03PS2) 10Fabfur: hiera: enable haproxykafka on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) [14:50:27] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [14:51:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [14:51:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2128.codfw.wmnet with reason: host reimage [14:54:07] (03PS1) 10Muehlenhoff: Switch ganeti7001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1090870 [14:54:37] (03PS1) 10Xcollazo: Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 [14:54:45] (03Merged) 10jenkins-bot: Upstream release v8.16.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1090861 (owner: 10Volans) [14:55:14] (03CR) 10CI reject: [V:04-1] Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (owner: 10Xcollazo) [14:55:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2128.codfw.wmnet with reason: host reimage [14:55:43] (03PS2) 10Xcollazo: Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) [14:55:58] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@2eb8320]: Stage Refine [airflow-dags@2eb8320d] [14:56:13] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@2eb8320]: Stage Refine [airflow-dags@2eb8320d] (duration: 00m 14s) [14:56:19] (03CR) 10CI reject: [V:04-1] Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [14:57:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2129.codfw.wmnet with OS bookworm [14:57:32] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [14:59:17] (03PS46) 10Arnaudb: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) [14:59:19] !log uploaded spicerack_8.16.0 to apt.wikimedia.org bullseye-wikimedia [14:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1500) [15:00:13] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10317118 (10akosiaris) Finally, let me say that for the last week, the logstash dashboard says 133 errors. The service serves per week ~540K queries. 133 errors out of ~... [15:00:29] (03CR) 10Xcollazo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [15:02:26] (03CR) 10Xcollazo: "Hmm.. not sure why tests are failing to run." [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [15:05:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:05:47] (03CR) 10Ssingh: "Yes that's a fair criticism of this approach." [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:45] (03CR) 10Xcollazo: "Before we enable, we should check with @joal@wikimedia.org if it is ok that the size of the one bz2 file from labswiki is, for now, 15.4GB" [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) (owner: 10Btullis) [15:07:34] (03CR) 10Ssingh: [C:03+1] hiera: enable haproxykafka on eqsin (round two) [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:08:14] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on eqsin (round two) [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:08:31] (03CR) 10Ssingh: [C:03+1] "cp5017 has an override fwiw." [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:09:23] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti7001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1090870 (owner: 10Muehlenhoff) [15:10:21] (03CR) 10Fabfur: "thanks for spotting that!" [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:10:29] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on eqsin (round two) [puppet] - 10https://gerrit.wikimedia.org/r/1090865 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:10:44] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:12:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2130.codfw.wmnet with OS bookworm [15:13:56] (03PS1) 10Fabfur: hiera: removed cp5017 host override for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1090873 (https://phabricator.wikimedia.org/T378578) [15:14:14] (03CR) 10Ssingh: [C:03+1] hiera: removed cp5017 host override for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1090873 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:14:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2128.codfw.wmnet with OS bookworm [15:14:36] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10317193 (10RobH) [15:15:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [15:15:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2129.codfw.wmnet with reason: host reimage [15:15:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2131.codfw.wmnet with OS bookworm [15:15:39] (03PS2) 10Daimona Eaytoy: tables-catalog: Add CampaignEvents and WikimediaCampaignEvents [puppet] - 10https://gerrit.wikimedia.org/r/1090851 (https://phabricator.wikimedia.org/T363581) [15:15:41] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add CampaignEvents and WikimediaCampaignEvents [puppet] - 10https://gerrit.wikimedia.org/r/1090851 (https://phabricator.wikimedia.org/T363581) (owner: 10Daimona Eaytoy) [15:15:43] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add CampaignEvents and WikimediaCampaignEvents [puppet] - 10https://gerrit.wikimedia.org/r/1090851 (https://phabricator.wikimedia.org/T363581) (owner: 10Daimona Eaytoy) [15:17:33] (03CR) 10Fabfur: [C:03+2] hiera: removed cp5017 host override for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1090873 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [15:18:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2129.codfw.wmnet with reason: host reimage [15:18:44] !log installing apache2 security updates [15:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:42] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10317221 (10RobH) I've scheduled the work via remote hands ticket CS1028070 and also detailed on that ticket the multiple ways to reach me during the wind... [15:20:56] FIRING: MaxConntrack: Max conntrack at 98.19% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:23:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [15:25:55] RESOLVED: MaxConntrack: Max conntrack at 94.45% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [15:26:29] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:27:36] (03CR) 10Urbanecm: [C:04-1] "-1 for visibility" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [15:29:14] (03CR) 10Giuseppe Lavagetto: "Generally LGTM but I'd add a better exception message when the input of upstream_version is not valid." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 (owner: 10CDanis) [15:30:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10317286 (10Jhancock.wm) [15:30:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2130.codfw.wmnet with reason: host reimage [15:30:21] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for IPs moving from old to new fundraising firewalls - cmooney@cumin1002" [15:30:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for IPs moving from old to new fundraising firewalls - cmooney@cumin1002" [15:30:46] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:50] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:33:12] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:33:58] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2131.codfw.wmnet with reason: host reimage [15:34:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2130.codfw.wmnet with reason: host reimage [15:35:59] !log failover ganeti master of magru01 to ganeti7001 [15:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:37:10] (03PS1) 10Slyngshede: Netbox: Update runbook, add dashboard and physicalhosts report. [alerts] - 10https://gerrit.wikimedia.org/r/1090875 (https://phabricator.wikimedia.org/T350694) [15:37:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2131.codfw.wmnet with reason: host reimage [15:37:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2129.codfw.wmnet with OS bookworm [15:38:32] PROBLEM - ganeti-wconfd running on ganeti7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:42:18] (03CR) 10Ayounsi: [C:03+1] Netbox: Update runbook, add dashboard and physicalhosts report. [alerts] - 10https://gerrit.wikimedia.org/r/1090875 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:42:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2132.codfw.wmnet with OS bookworm [15:45:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2133.codfw.wmnet with OS bookworm [15:45:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/hdfs-synchronizer: apply [15:47:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [15:47:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [15:47:53] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [15:48:33] (03CR) 10Xcollazo: [C:03+1] [dumps] - Categorise labswiki (wikitech) as a big wiki [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) (owner: 10Btullis) [15:52:40] (03PS1) 10Ssingh: cp7001: temporarily set check_min_fe_mem to true [puppet] - 10https://gerrit.wikimedia.org/r/1090879 [15:52:48] 06SRE, 10Charts, 06Infrastructure-Foundations, 07Service-deployment-requests: New Service Request: chart-renderer - https://phabricator.wikimedia.org/T376939#10317453 (10CDanis) 05Open→03Resolved a:03CDanis [15:53:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2130.codfw.wmnet with OS bookworm [15:53:24] (03CR) 10Ssingh: [C:03+2] cp7001: temporarily set check_min_fe_mem to true [puppet] - 10https://gerrit.wikimedia.org/r/1090879 (owner: 10Ssingh) [15:54:00] (03PS1) 10Brouberol: growthbook: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090880 (https://phabricator.wikimedia.org/T379711) [15:56:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2131.codfw.wmnet with OS bookworm [15:56:51] (03PS3) 10Scott French: changeprop: add latency_sensitive_jobs_config (jobqueue) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) [15:56:51] (03PS4) 10Scott French: changeprop-jobqueue: add latency-sensitive upload jobs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) [15:57:40] (03Abandoned) 10Scott French: changeprop-jobqueue: enable AssembleUploadChunks rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089315 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [16:01:28] (03PS1) 10Ssingh: hiera: set do_ipv6_primary_ra for all LVS in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1090882 (https://phabricator.wikimedia.org/T358260) [16:02:16] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1090882 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [16:02:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2132.codfw.wmnet with reason: host reimage [16:02:29] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1090882 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [16:03:04] (03CR) 10Scott French: "Thanks, Joe!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [16:03:47] (03CR) 10Elukey: [C:03+2] docker_registry_ha: log X-Client-IP from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1090867 (owner: 10Elukey) [16:04:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2133.codfw.wmnet with reason: host reimage [16:04:48] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: set do_ipv6_primary_ra for all LVS in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1090882 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [16:05:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2132.codfw.wmnet with reason: host reimage [16:06:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2134.codfw.wmnet with OS bookworm [16:06:55] (03CR) 10Btullis: [V:03+1 C:03+2] [dumps] - Categorise labswiki (wikitech) as a big wiki [puppet] - 10https://gerrit.wikimedia.org/r/1090834 (https://phabricator.wikimedia.org/T374952) (owner: 10Btullis) [16:07:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2135.codfw.wmnet with OS bookworm [16:08:00] !log running agent on A:ulsfo and A:lvs [16:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2133.codfw.wmnet with reason: host reimage [16:10:19] (03CR) 10JMeybohm: [C:03+2] k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp [cookbooks] - 10https://gerrit.wikimedia.org/r/1090806 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [16:11:03] (03CR) 10Scott French: "That's totally fair, and in retrospect any "correct" job that claims to be idempotent to support the retry-on-failure case also needs to h" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [16:13:18] (03CR) 10FNegri: "This class is still referenced in modules/profile/manifests/openstack/base/puppetmaster/frontend.pp, though I'm not sure if that profile i" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:14:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [16:15:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [16:17:12] (03PS1) 10Muehlenhoff: Switch ganeti7003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1090890 [16:17:20] (03Merged) 10jenkins-bot: k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp [cookbooks] - 10https://gerrit.wikimedia.org/r/1090806 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [16:17:43] (03PS2) 10Srishakatux: Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) [16:20:13] PROBLEM - Host ms-be2082 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:10] (03CR) 10Srishakatux: Add new namespaces to hsb wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [16:21:27] (03PS1) 10JMeybohm: preseed: Migrate wikikube-ctrl2* to containerd partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1090891 (https://phabricator.wikimedia.org/T377877) [16:22:21] RECOVERY - Host ms-be2082 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [16:23:16] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti7003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1090890 (owner: 10Muehlenhoff) [16:24:18] 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778 (10ayounsi) 03NEW [16:24:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2132.codfw.wmnet with OS bookworm [16:25:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2134.codfw.wmnet with reason: host reimage [16:26:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2135.codfw.wmnet with reason: host reimage [16:27:35] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-10-15-192817 to 2024-11-13-145636 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090892 (https://phabricator.wikimedia.org/T356144) [16:27:55] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-10-10-202633 to 2024-11-12-161156 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090893 (https://phabricator.wikimedia.org/T356144) [16:29:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [16:29:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2134.codfw.wmnet with reason: host reimage [16:29:16] (03CR) 10Jelto: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1090891 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [16:29:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2133.codfw.wmnet with OS bookworm [16:29:56] !log shutdown old office link interface - T379778 [16:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:06] T379778: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778 [16:30:34] (03CR) 10JMeybohm: [C:03+2] preseed: Migrate wikikube-ctrl2* to containerd partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1090891 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [16:30:53] !log reload nginx on registry* to pick up logging changes (log of X-Client-IP from the CDN) [16:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:08] !log jayme@cumin2002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2002.codfw.wmnet [16:31:37] (03PS1) 10Bking: hdfs-synchronizer: remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090894 (https://phabricator.wikimedia.org/T371994) [16:31:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2135.codfw.wmnet with reason: host reimage [16:33:12] 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10317697 (10ayounsi) [16:33:29] (03PS3) 10Bking: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [16:37:03] (03PS1) 10Ayounsi: Remove office policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/1090897 (https://phabricator.wikimedia.org/T379778) [16:37:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [16:38:33] PROBLEM - Etcd cluster health on wikikube-ctrl2002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [16:39:41] FIRING: [2x] ProbeDown: Service wikikube-ctrl2002:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:39:46] oh [16:39:53] ^ jayme ? [16:39:53] that is me, sorry [16:39:56] ok! [16:40:00] !incidents [16:40:01] 5440 (ACKED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [16:40:03] acked [16:40:07] that was fast :) [16:40:13] o/ [16:40:51] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: reimage [16:41:07] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: reimage [16:42:24] (03CR) 10Stevemunene: [C:03+1] growthbook: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090880 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [16:42:45] (03PS1) 10Brouberol: datahub: leverage liveness and readiness probes for the gms and consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090900 (https://phabricator.wikimedia.org/T379711) [16:43:34] (03CR) 10Brouberol: [C:03+2] growthbook: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090880 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [16:47:06] 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10317783 (10RobH) [16:47:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:47:24] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2002.codfw.wmnet with OS bookworm [16:47:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:47:40] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-ctrl2002 [16:48:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2134.codfw.wmnet with OS bookworm [16:48:33] (03PS1) 10Scott French: changeprop-jobqueue: bump max.poll.interval.ms to 2h [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090898 (https://phabricator.wikimedia.org/T356241) [16:49:46] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [16:50:07] 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10317801 (10RobH) I'll have the xconnect disconnected by remote hands during the cross-connect disconnection, putting in the cross c... [16:50:44] (03PS1) 10Brouberol: growthbook: increase the PVC size to 30GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090905 (https://phabricator.wikimedia.org/T379711) [16:50:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2135.codfw.wmnet with OS bookworm [16:52:08] (03CR) 10Brouberol: [C:03+2] growthbook: increase the PVC size to 30GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090905 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [16:53:18] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-ctrl2002 - jayme@cumin2002" [16:53:20] (03PS1) 10Ssingh: Revert "cp7001: temporarily set check_min_fe_mem to true" [puppet] - 10https://gerrit.wikimedia.org/r/1090906 [16:53:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-ctrl2002 - jayme@cumin2002" [16:53:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:24] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-ctrl2002.codfw.wmnet 76.32.192.10.in-addr.arpa 6.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:53:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:53:27] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-ctrl2002.codfw.wmnet 76.32.192.10.in-addr.arpa 6.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:53:28] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2002 [16:53:39] (03PS1) 10JHathaway: Revert "sre.hosts.reimage: improve UEFI for Supermicro" [cookbooks] - 10https://gerrit.wikimedia.org/r/1090907 [16:53:48] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2002 [16:53:48] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-ctrl2002 [16:53:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:56:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:22] !log homer 'lsw1-b2-codfw*' commit T377008 [16:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:28] T377008: wikikube-worker21[28-35] implementation tracking - https://phabricator.wikimedia.org/T377008 [17:00:58] (03CR) 10Muehlenhoff: "That class was only used by standalone Puppet 5 puppet masters, which are no longer in use (and will be removed in a separate step)." [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [17:01:08] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [17:01:15] !log homer 'lsw1-b4-codfw*' commit T377008 [17:01:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10317829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [17:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:24] !log homer 'cr*codfw*' commit T377008 [17:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:24] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2082.codfw.wmnet with OS bullseye [17:03:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10317841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye execut... [17:03:54] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: bump max.poll.interval.ms to 2h [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090898 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:04:46] (03CR) 10FNegri: [C:03+1] "> That class was only used by standalone Puppet 5 puppet masters, which are no longer in use (and will be removed in a separate step)." [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [17:07:29] cr bgp alerts are me, btw [17:08:03] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-Needs-Improvement: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10317851 (10colewhite) [17:09:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 286, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:21] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage [17:14:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage [17:15:06] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [17:15:20] (03CR) 10Tchanders: [C:03+1] Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090511 (https://phabricator.wikimedia.org/T379583) (owner: 10Dreamy Jazz) [17:15:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10317906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [17:17:04] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 370, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:25] !log homer 'lsw1-c4-codfw*' commit 'T377008' [17:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:28] T377008: wikikube-worker21[28-35] implementation tracking - https://phabricator.wikimedia.org/T377008 [17:17:32] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:08] !log homer 'lsw1-d4-codfw*' commit 'T377008' [17:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:26] (03CR) 10Scott French: [C:03+2] changeprop-jobqueue: bump max.poll.interval.ms to 2h [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090898 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:18:54] !log homer 'lsw1-c2-codfw*' commit 'T377008' [17:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:29] (03Merged) 10jenkins-bot: changeprop-jobqueue: bump max.poll.interval.ms to 2h [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090898 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:20:18] !log homer 'lsw1-d2-codfw*' commit 'T377008' [17:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:13] * MichaelG_WMF is going to run a GrowthExperiments-related maintenance script on cswiki to take care of some dangling search-index and db entries. Not expecting any trouble [17:22:39] (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop: add latency_sensitive_jobs_config (jobqueue) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:22:43] (03PS1) 10Cathal Mooney: Add validators for Netbox IPsec elements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1090911 (https://phabricator.wikimedia.org/T378020) [17:23:05] (03PS6) 10Cathal Mooney: Expose IPsec tunnel configuration from Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) [17:23:10] (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop-jobqueue: add latency-sensitive upload jobs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:23:24] (03PS1) 10BCornwall: varnish: Increase RSA cert warnings to 10% [puppet] - 10https://gerrit.wikimedia.org/r/1090912 (https://phabricator.wikimedia.org/T370837) [17:23:30] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:23:31] (03PS4) 10Cathal Mooney: Add automation for IPsec tunnels on srx devices based on Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1089861 (https://phabricator.wikimedia.org/T378020) [17:23:32] (03CR) 10Scott French: [C:03+2] changeprop: add latency_sensitive_jobs_config (jobqueue) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:23:45] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2128-2135].codfw.wmnet [17:23:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2128-2135].codfw.wmnet [17:24:32] (03CR) 10Ssingh: [C:03+1] varnish: Increase RSA cert warnings to 10% [puppet] - 10https://gerrit.wikimedia.org/r/1090912 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:24:34] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4517/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090912 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:24:53] (03Merged) 10jenkins-bot: changeprop: add latency_sensitive_jobs_config (jobqueue) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089313 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:25:53] (03PS5) 10Scott French: changeprop-jobqueue: add latency-sensitive upload jobs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) [17:26:37] (03PS2) 10Cathal Mooney: Add validators for Netbox IPsec elements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1090911 (https://phabricator.wikimedia.org/T378020) [17:27:22] (03CR) 10Cathal Mooney: "Updated now to remove the authentication algo stuff, with matching validators for netbox so one is not added" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1089854 (https://phabricator.wikimedia.org/T378020) (owner: 10Cathal Mooney) [17:27:35] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:27:37] (03CR) 10Scott French: [C:03+2] changeprop-jobqueue: add latency-sensitive upload jobs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:28:44] (03CR) 10BCornwall: [V:03+2 C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1090912 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:28:48] (03Merged) 10jenkins-bot: changeprop-jobqueue: add latency-sensitive upload jobs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089314 (https://phabricator.wikimedia.org/T379035) (owner: 10Scott French) [17:30:46] (03PS1) 10Cathal Mooney: Enable validators for IKE/IPsec definitions [puppet] - 10https://gerrit.wikimedia.org/r/1090914 (https://phabricator.wikimedia.org/T378020) [17:32:17] (03PS3) 10Cathal Mooney: Add validators for Netbox IPsec elements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1090911 (https://phabricator.wikimedia.org/T378020) [17:32:59] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [17:33:23] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [17:33:41] (03PS1) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) [17:35:38] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [17:36:29] (03CR) 10CI reject: [V:04-1] haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [17:36:32] (03PS2) 10CDanis: docker-pkg: add upstream_version template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 [17:36:32] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:37:02] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [17:38:09] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10318061 (10herron) p:05Triage→03Medium [17:38:18] (03CR) 10CDanis: docker-pkg: add upstream_version template helper (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 (owner: 10CDanis) [17:38:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2002.codfw.wmnet with OS bookworm [17:39:17] (03CR) 10Brouberol: [C:03+1] hdfs-synchronizer: remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090894 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [17:39:30] !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl2002.codfw.wmnet [17:39:31] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl2002.codfw.wmnet [17:40:02] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2002.codfw.wmnet [17:41:45] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090894 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [17:41:56] (03CR) 10CDanis: [C:03+2] docker-pkg: add upstream_version template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 (owner: 10CDanis) [17:42:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10318096 (10herron) Hello! A few next-steps to move forward with this request: * @Dbrant could you please confirm your ssh key to me via a secondary method? Email or slack would be fine * @S... [17:44:58] (03CR) 10Bking: [C:03+2] hdfs-synchronizer: remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090894 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [17:45:28] 06SRE, 10Release Pipeline, 06serviceops, 07Epic, 10Release-Engineering-Team (Seen): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901#10318127 (10akosiaris) [17:45:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:45:58] (03Merged) 10jenkins-bot: hdfs-synchronizer: remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090894 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [17:46:21] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [17:46:22] (03Merged) 10jenkins-bot: docker-pkg: add upstream_version template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090562 (owner: 10CDanis) [17:47:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:49:05] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [17:49:07] !log cdanis@deploy2002 Started deploy [docker-pkg/deploy@38eb04d]: ship upstream_version helper [17:49:13] 06SRE, 10Release Pipeline, 06serviceops, 07Epic, 10Release-Engineering-Team (Seen): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901#10318162 (10akosiaris) 05Open→03Resolved a:03akosiaris Everything that was in scope has been migrated. Whi... [17:49:29] !log cdanis@deploy2002 Finished deploy [docker-pkg/deploy@38eb04d]: ship upstream_version helper (duration: 00m 32s) [17:53:37] (03PS1) 10CDanis: release 4.0.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090923 [17:53:55] (03PS4) 10Bking: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [17:53:59] (03CR) 10CDanis: [C:03+2] release 4.0.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090923 (owner: 10CDanis) [17:53:59] !log mwmaint2002: foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --search-index --verbose --random # T379057 [17:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:02] T379057: Drop dangling DB records in all Add Link wikis - https://phabricator.wikimedia.org/T379057 [17:57:57] (03Merged) 10jenkins-bot: release 4.0.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1090923 (owner: 10CDanis) [17:57:58] (03PS1) 10Kamila Součková: doc: fix introduction code bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 [17:59:17] (03CR) 10Kamila Součková: "🙀" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 (owner: 10Kamila Součková) [17:59:20] (03PS3) 10Btullis: Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [17:59:22] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [17:59:59] (03CR) 10CI reject: [V:04-1] Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [18:00:04] swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1800). nyaa~ [18:00:07] (03PS4) 10Btullis: Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [18:00:15] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [18:00:17] (03PS1) 10Bvibber: GlobalJsonLinksCachePurgeJob to actually invalidate caches [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090928 (https://phabricator.wikimedia.org/T374746) [18:00:31] here o/ [18:01:10] wrapping up some earlier work, but I hope to get started on this within the next 30m [18:01:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090928 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [18:04:19] (03CR) 10RLazarus: [C:03+1] mwdebug-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085494 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:04:23] (03CR) 10RLazarus: [C:03+1] hieradata: switch mw-debug "next" to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1087983 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:04:29] (03PS1) 10Bking: hdfs-synchronizer: remove more unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090929 (https://phabricator.wikimedia.org/T371994) [18:04:39] (03PS2) 10Kamila Součková: doc: fix introduction code bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 [18:08:01] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10318328 (10herron) [18:08:30] (03CR) 10Bking: [C:03+2] hdfs-synchronizer: remove more unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090929 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:08:46] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [18:08:49] (03CR) 10Bking: [C:03+2] "Self-merging, as this change does not touch any production services." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090929 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:09:29] (03Merged) 10jenkins-bot: hdfs-synchronizer: remove more unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090929 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:10:07] (03PS5) 10Bking: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [18:11:18] (03CR) 10Joal: [C:03+1] "The file size shouldn't be a problem for HDFS or the conversion job. The only issue could be at copy time, if it's so long that hiccups ha" [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) (owner: 10Btullis) [18:11:39] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [18:11:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10318362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple... [18:12:05] (03CR) 10Btullis: [C:03+2] Fix dump_fillin_wd systemd timer schedule. [puppet] - 10https://gerrit.wikimedia.org/r/1090871 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [18:13:04] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Research allowing read-only access to the superset api from requestctl's web UI - https://phabricator.wikimedia.org/T379718#10318358 (10BTullis) I think that we can do this, but maybe we should try to a... [18:13:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [18:13:28] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [18:13:30] T371400: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 [18:15:43] (03PS1) 10CDanis: Upgrade to 4.0.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1090931 [18:17:25] (03CR) 10Xcollazo: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) (owner: 10Btullis) [18:17:31] (03CR) 10CDanis: [V:03+2 C:03+2] Upgrade to 4.0.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1090931 (owner: 10CDanis) [18:18:27] !log cdanis@deploy2002 Started deploy [docker-pkg/deploy@9d71ac3]: deploy 4.0.2 for realsies [18:18:29] (03PS1) 10Bking: hdfs-sychronizer: once again, remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090932 (https://phabricator.wikimedia.org/T371994) [18:21:03] !log cdanis@deploy2002 Finished deploy [docker-pkg/deploy@9d71ac3]: deploy 4.0.2 for realsies (duration: 02m 41s) [18:21:14] (03CR) 10Bking: [C:03+2] hdfs-sychronizer: once again, remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090932 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:21:17] (03PS1) 10Ssingh: Release 9.2.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) [18:21:34] (03CR) 10Bking: [C:03+2] "Self-merging, as this does not touch a production service." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090932 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:21:43] !log cdanis@deploy2002 Started deploy [docker-pkg/deploy@9d71ac3]: (no justification provided) [18:22:06] starting planned infra-window work now [18:22:10] (03CR) 10Scott French: [C:03+2] mwdebug-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085494 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:22:21] !log cdanis@deploy2002 Finished deploy [docker-pkg/deploy@9d71ac3]: (no justification provided) (duration: 00m 40s) [18:22:23] (03Merged) 10jenkins-bot: hdfs-sychronizer: once again, remove unneeded mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090932 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:23:01] (03PS6) 10Bking: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [18:23:24] (03Merged) 10jenkins-bot: mwdebug-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085494 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:25:16] (03PS7) 10Bking: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [18:26:28] !log cdanis@deploy2002 Started deploy [docker-pkg/deploy@9d71ac3]: (no justification provided) [18:26:43] !log cdanis@deploy2002 Finished deploy [docker-pkg/deploy@9d71ac3]: (no justification provided) (duration: 00m 18s) [18:26:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10318425 (10cmooney) 05Open→03Resolved @Jclark-ctr I've erased the config on all the old devices now, so feel free to r... [18:28:52] (03PS1) 10CDanis: actually commit 4.0.2 artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1090934 [18:28:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10318437 (10cmooney) @robh the migration work is now done, all that remains is to remove the old devices and any cables connecting to them and make su... [18:29:06] (03CR) 10CDanis: [V:03+2 C:03+2] actually commit 4.0.2 artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1090934 (owner: 10CDanis) [18:29:26] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:29:40] !log cdanis@deploy2002 Started deploy [docker-pkg/deploy@3499887]: I really hope this works this time [18:30:12] !log cdanis@deploy2002 Finished deploy [docker-pkg/deploy@3499887]: I really hope this works this time (duration: 00m 34s) [18:30:49] (03PS1) 10Bking: hdfs-synchronizer: remove unneeded mcrouter config, part 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090935 (https://phabricator.wikimedia.org/T371994) [18:30:53] cdanis: so this is the magic dust they talk about. /me takes notes. [18:31:07] sukhe: it was my first encounter with scap3 [18:31:23] compounded by the instructions being partially wrong, and also, partially not followed last time [18:31:47] cdanis: there are times I have contemplated adding this message to some commits [18:32:09] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:32:46] (03CR) 10Bking: [C:03+2] hdfs-synchronizer: remove unneeded mcrouter config, part 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090935 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:33:00] (03CR) 10Bking: [C:03+2] "Self-merging, as this does not touch a production service." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090935 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:33:01] anyway thankfully it did indeed work that time [18:33:05] <3 [18:33:46] (03Merged) 10jenkins-bot: hdfs-synchronizer: remove unneeded mcrouter config, part 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090935 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:33:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:34:29] (03PS8) 10Bking: Added helmfile.d dse-k8s-services entries for HDFS synchronizer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088608 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [18:34:45] (03CR) 10Scott French: [C:03+2] hieradata: switch mw-debug "next" to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1087983 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:36:09] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:36:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10318507 (10RobH) a:03Jclark-ctr I'd hand this over to either John or Valerie as ops-eqiad for them to remove any devices and cables. John, as sr t... [18:46:07] (03CR) 10Ssingh: "sukhe@build2001:~/trafficserver$ lintian /var/cache/pbuilder/result/bullseye-amd64/trafficserver_9.2.5-1wm1_amd64.changes | grep ^E" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh) [18:47:33] PROBLEM - SSH on rpki1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:48:17] !log swfrench@deploy2002 Started scap sync-world: Deployment to switch mwdebug-next to publish-81 - T372604 [18:48:23] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [18:48:33] noting here that i plan to roll train forward 10 or 15m into the upcoming window. [18:49:17] brennen: ack, thanks for the heads up! I'm in the very last stage of the work I'm doing, so should be out of your way by then :) [18:50:09] (03PS1) 10Dreamrimmer: Allow Wikidata bureaucrats to remove admin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090937 (https://phabricator.wikimedia.org/T379635) [18:50:11] !log swfrench@deploy2002 Finished scap sync-world: Deployment to switch mwdebug-next to publish-81 - T372604 (duration: 01m 53s) [18:52:17] (03CR) 10Ssingh: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [18:52:22] swfrench-wmf: no rush if it happens to go a bit over - i'm just catching a quick walk in the sunshine before getting stuck at the desk for a bit. :) [18:52:34] 100% [18:52:46] nice, enjoy :) [18:52:50] Maroooooned for all eternity [18:53:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1029:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1029 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:54:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:51] all done on my end [18:55:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090937 (https://phabricator.wikimedia.org/T379635) (owner: 10Dreamrimmer) [18:56:49] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [18:57:32] (03PS1) 10CDanis: Jaeger 1.63.0, fixes duped archived traces [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1090938 (https://phabricator.wikimedia.org/T375123) [18:58:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/hdfs-synchronizer: apply [18:59:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:05] brennen and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T1900). [19:00:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1005 - jclark@cumin1002" [19:00:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1005 - jclark@cumin1002" [19:00:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:01:15] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:01:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:02:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:02:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:02:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:03:28] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:07:33] RECOVERY - SSH on rpki1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:08:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/hdfs-synchronizer: apply [19:09:40] !log 1.44.0-wmf.3 train status (T375662): no current blockers, rolling to group1. [19:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:43] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:10:00] (03CR) 10Ssingh: [C:03+2] Revert "cp7001: temporarily set check_min_fe_mem to true" [puppet] - 10https://gerrit.wikimedia.org/r/1090906 (owner: 10Ssingh) [19:10:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:10:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:10:43] PROBLEM - SSH on rpki1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:10:53] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:10:56] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090942 (https://phabricator.wikimedia.org/T375662) [19:10:57] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090942 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [19:11:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:11:51] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090942 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [19:12:35] RECOVERY - SSH on rpki1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:13:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1005.eqiad.wmnet with OS bullseye [19:15:45] PROBLEM - SSH on rpki1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:16:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10318716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1005.eqiad.wmnet with OS bullseye [19:19:05] (03PS1) 10Bking: dse-k8s-services: add CNAME for hdfs-synchronizer [dns] - 10https://gerrit.wikimedia.org/r/1090945 (https://phabricator.wikimedia.org/T371994) [19:19:19] (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for chart-renderer [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090946 [19:19:28] (03CR) 10Herron: [C:03+1] "lgtm, nice improvement too!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1090938 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [19:19:43] (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Add placeholder for chart-renderer [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090946 (owner: 10Ahmon Dancy) [19:20:17] dancy: ah, oops :) [19:20:26] (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for chart-renderer [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1090946 (owner: 10Ahmon Dancy) [19:21:00] cdanis: Not a problem. I don't expect anyone else to babysit the train-dev branch. [19:21:05] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Update [19:21:27] or any DevServices.php config [19:22:11] (03CR) 10Ssingh: [C:03+1] dse-k8s-services: add CNAME for hdfs-synchronizer [dns] - 10https://gerrit.wikimedia.org/r/1090945 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:22:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:22:59] (03CR) 10CDanis: [V:03+2 C:03+2] Jaeger 1.63.0, fixes duped archived traces [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1090938 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [19:23:14] 06SRE, 10Observability-Alerting, 06Traffic: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807 (10ssingh) 03NEW [19:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1029:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1029 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:23:49] RECOVERY - SSH on rpki1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:23:54] 06SRE, 10Observability-Alerting, 06Traffic: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807#10318786 (10ssingh) p:05Triage→03Medium [19:26:27] aokoth@cumin1002 aokoth: The backup on gitlab1004 is complete, ready to proceed with upgrade. [19:26:31] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.3 refs T375662 [19:26:36] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:27:16] !log brennen@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.3 refs T375662 [19:29:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1029:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1029 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:30:13] (03CR) 10Ottomata: [C:04-1] dse-k8s-services: add CNAME for hdfs-synchronizer (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1090945 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:31:43] (03CR) 10Ebernhardson: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [19:35:08] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Update [19:35:28] (03PS1) 10CDanis: jaeger: to 1.63.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090948 (https://phabricator.wikimedia.org/T375123) [19:36:27] (03CR) 10Ottomata: "I remove my -1. Aleks informed me that the intention is to rename all this stuff eventually." [dns] - 10https://gerrit.wikimedia.org/r/1090945 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:36:51] PROBLEM - SSH on rpki1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:36:55] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bookworm [19:37:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10318880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm [19:37:12] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [19:41:43] RECOVERY - SSH on rpki1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:44:01] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [19:44:45] (03CR) 10CDanis: [C:03+2] jaeger: to 1.63.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090948 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [19:45:57] (03Merged) 10jenkins-bot: jaeger: to 1.63.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090948 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [19:46:29] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:47:06] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:47:52] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:51:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding thanos-be2005 to codfw - jhancock@cumin2002" [19:51:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding thanos-be2005 to codfw - jhancock@cumin2002" [19:51:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:52:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:55:22] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:55:43] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:57:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:58:22] (03PS1) 10Aude: Enable autocreateaccount on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090953 (https://phabricator.wikimedia.org/T378216) [19:58:23] !log brennen@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.3 refs T375662 (duration: 31m 07s) [19:58:26] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:58:28] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:58:41] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:59:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-be2005 [19:59:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-be2005 [19:59:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:59:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:02:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:02:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:14:06] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [20:16:50] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [20:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1029:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1029 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:28:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:28:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:30:56] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:31:09] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:34:21] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@3fc12d6]: Stage Refine [airflow-dags@3fc12d60] [20:34:36] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@3fc12d6]: Stage Refine [airflow-dags@3fc12d60] (duration: 00m 15s) [20:34:43] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [20:35:04] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [20:36:09] (03CR) 10Bking: [C:03+2] dse-k8s-services: add CNAME for hdfs-synchronizer [dns] - 10https://gerrit.wikimedia.org/r/1090945 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [20:36:20] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:36:22] (03CR) 10Bking: [C:03+2] dse-k8s-services: add CNAME for hdfs-synchronizer (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1090945 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [20:36:33] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:37:02] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [20:37:50] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [20:39:43] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bookworm [20:39:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10319265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm comple... [20:41:09] (03PS1) 10Jgreen: Switch analytics.frdev.wikimedia.org to the frack codfw analytics server. [dns] - 10https://gerrit.wikimedia.org/r/1090960 (https://phabricator.wikimedia.org/T366950) [20:41:52] (03PS2) 10Jgreen: Switch analytics.frdev.wikimedia.org to the frack codfw analytics server. [dns] - 10https://gerrit.wikimedia.org/r/1090960 (https://phabricator.wikimedia.org/T366950) [20:43:38] (03PS3) 10Ssingh: wikimedia.org: remove obsolete records for pay-lvs100[12].wm.org [dns] - 10https://gerrit.wikimedia.org/r/1088612 [20:43:55] (03CR) 10Jgreen: [C:03+2] wikimedia.org: remove obsolete records for pay-lvs100[12].wm.org [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [20:44:41] (03CR) 10Jgreen: [V:03+2 C:03+2] wikimedia.org: remove obsolete records for pay-lvs100[12].wm.org [dns] - 10https://gerrit.wikimedia.org/r/1088612 (owner: 10Ssingh) [20:46:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is bac460c5d5a8d97376b3e05a68c31765b6477915, dns.git is dfabcb16c999d50e870045cc2a1c869b601a427f) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:46:42] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bookworm [20:46:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10319277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm [20:47:00] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [20:47:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [20:48:00] !log deployed changeprop to clear no-op chart version diffs from CR 1089313 [20:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:31] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:49:44] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:49:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is dfabcb16c999d50e870045cc2a1c869b601a427f, dns.git is f966db344f453393b3c922ff8b09cfa873c14b6b) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:49:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is dfabcb16c999d50e870045cc2a1c869b601a427f, dns.git is f966db344f453393b3c922ff8b09cfa873c14b6b) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:50:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is dfabcb16c999d50e870045cc2a1c869b601a427f, dns.git is f966db344f453393b3c922ff8b09cfa873c14b6b) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:51:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:54:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090953 (https://phabricator.wikimedia.org/T378216) (owner: 10Aude) [20:54:39] (03CR) 10Bvibber: [C:03+1] "looks right!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090953 (https://phabricator.wikimedia.org/T378216) (owner: 10Aude) [20:54:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:54:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:55:14] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:55:22] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@3fc12d6]: Stage Refine [airflow-dags@3fc12d60] [20:55:52] (03Abandoned) 10Jgreen: Switch analytics.frdev.wikimedia.org to the frack codfw analytics server. [dns] - 10https://gerrit.wikimedia.org/r/1090960 (https://phabricator.wikimedia.org/T366950) (owner: 10Jgreen) [20:56:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:56:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:56:36] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@3fc12d6]: Stage Refine [airflow-dags@3fc12d60] (duration: 01m 14s) [20:56:37] (03PS14) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [20:57:18] (03PS1) 10CDanis: Drop useless loopback TLS between jaeger & oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090962 (https://phabricator.wikimedia.org/T375123) [20:58:26] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4519/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T2100). [21:00:06] bvibber and aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:25] o/ [21:00:28] i can deploy [21:00:29] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@3487da3]: Stage Refine [airflow-dags@3487da3a] [21:00:47] bvibber: 1st in the queue! [21:00:53] \o/ [21:00:58] thx cjming :D [21:01:05] hi [21:01:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090928 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [21:01:51] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@3487da3]: Stage Refine [airflow-dags@3487da3a] (duration: 01m 22s) [21:02:20] hi aude - i will do your patch next [21:02:31] (03PS1) 10Jgreen: Switch analytics.frdev.wikimedia.org to the frack codfw analytics server. [dns] - 10https://gerrit.wikimedia.org/r/1090964 (https://phabricator.wikimedia.org/T366950) [21:02:38] thanks [21:03:59] (03PS3) 10Srishakatux: Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) [21:05:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:05:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:05:35] (03CR) 10Herron: [C:03+1] Drop useless loopback TLS between jaeger & oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090962 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [21:07:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-be2005 [21:07:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-be2005 [21:07:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:07:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:09:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:09:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:09:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:09:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:09:44] (03PS1) 10Tchanders: Revert "Disallow AbuseFilter protected variables use on non-temp-user wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090965 (https://phabricator.wikimedia.org/T379503) [21:10:29] (03CR) 10Dwisehaupt: [C:03+2] "This looks correct to me." [dns] - 10https://gerrit.wikimedia.org/r/1090964 (https://phabricator.wikimedia.org/T366950) (owner: 10Jgreen) [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [21:16:15] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:18:45] (03PS2) 10CDanis: Drop useless loopback TLS between jaeger & oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090962 (https://phabricator.wikimedia.org/T375123) [21:18:50] (03CR) 10CDanis: [C:03+2] Drop useless loopback TLS between jaeger & oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090962 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [21:18:55] cjming: I've just added another one to the window, but I can do it myself once you're finished [21:19:31] Tchanders: sounds good [21:19:45] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [21:20:04] (03Merged) 10jenkins-bot: Drop useless loopback TLS between jaeger & oauth2-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090962 (https://phabricator.wikimedia.org/T375123) (owner: 10CDanis) [21:20:24] (03Merged) 10jenkins-bot: GlobalJsonLinksCachePurgeJob to actually invalidate caches [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090928 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [21:20:30] \o/ [21:20:44] bvibber: can i go ahead and sync or should/can it be tested? [21:20:44] 🎉 [21:20:48] (03PS1) 10Jgreen: Reduce TTL on analytics.frdev.wikimedia.org to 1h. [dns] - 10https://gerrit.wikimedia.org/r/1090966 [21:20:55] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1090928|GlobalJsonLinksCachePurgeJob to actually invalidate caches (T374746)]] [21:21:02] cjming: go ahead; the action's in the job queue so can't be tested via debug servers [21:21:03] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [21:21:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:21:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:21:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:55] (03CR) 10Jgreen: [C:03+2] Reduce TTL on analytics.frdev.wikimedia.org to 1h. [dns] - 10https://gerrit.wikimedia.org/r/1090966 (owner: 10Jgreen) [21:27:23] !log cjming@deploy2002 cjming, bvibber: Backport for [[gerrit:1090928|GlobalJsonLinksCachePurgeJob to actually invalidate caches (T374746)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:26] !log cjming@deploy2002 cjming, bvibber: Continuing with sync [21:27:26] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [21:27:38] party time [21:27:48] lol [21:34:22] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090928|GlobalJsonLinksCachePurgeJob to actually invalidate caches (T374746)]] (duration: 13m 27s) [21:34:25] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [21:34:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090953 (https://phabricator.wikimedia.org/T378216) (owner: 10Aude) [21:34:50] bvibber: should be live! [21:34:55] \o/ thx [21:35:01] yw! [21:35:30] (03Merged) 10jenkins-bot: Enable autocreateaccount on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090953 (https://phabricator.wikimedia.org/T378216) (owner: 10Aude) [21:35:58] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1090953|Enable autocreateaccount on testcommonswiki (T378216)]] [21:36:02] T378216: Update test commons to support usage tracking testing - https://phabricator.wikimedia.org/T378216 [21:36:50] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bookworm [21:38:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10319440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm comple... [21:39:59] !log cjming@deploy2002 aude, cjming: Backport for [[gerrit:1090953|Enable autocreateaccount on testcommonswiki (T378216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:40:04] aude: your patch is up on mwdebug if you'd like to test [21:40:08] lmk when to sync [21:40:28] testing it [21:42:13] thanks [21:42:31] good to sync then i assume? [21:44:11] we can sync [21:44:18] !log cjming@deploy2002 aude, cjming: Continuing with sync [21:44:56] (03PS1) 10Kgraessle: Enable AutoModerator on afwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090968 (https://phabricator.wikimedia.org/T376597) [21:48:12] (03CR) 10Eevans: [C:03+2] Update corto puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [21:48:58] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090953|Enable autocreateaccount on testcommonswiki (T378216)]] (duration: 12m 59s) [21:49:08] T378216: Update test commons to support usage tracking testing - https://phabricator.wikimedia.org/T378216 [21:49:15] aude: should be on prod! [21:49:23] Tchanders: all yours if you're still around [21:49:32] cjming: thanks! [21:50:00] thanks [21:50:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090965 (https://phabricator.wikimedia.org/T379503) (owner: 10Tchanders) [21:50:40] (03Merged) 10jenkins-bot: Revert "Disallow AbuseFilter protected variables use on non-temp-user wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090965 (https://phabricator.wikimedia.org/T379503) (owner: 10Tchanders) [21:51:07] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1090965|Revert "Disallow AbuseFilter protected variables use on non-temp-user wikis" (T379503)]] [21:51:28] T379503: Disable AbuseFilter protected variables features on wikis where Temporary accounts are not about to be released - https://phabricator.wikimedia.org/T379503 [21:55:10] !log tchanders@deploy2002 tchanders: Backport for [[gerrit:1090965|Revert "Disallow AbuseFilter protected variables use on non-temp-user wikis" (T379503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:55:27] (03PS1) 10Bking: dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) [21:55:33] !log tchanders@deploy2002 tchanders: Continuing with sync [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241113T2200) [22:00:11] !log tchanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090965|Revert "Disallow AbuseFilter protected variables use on non-temp-user wikis" (T379503)]] (duration: 09m 03s) [22:00:34] T379503: Disable AbuseFilter protected variables features on wikis where Temporary accounts are not about to be released - https://phabricator.wikimedia.org/T379503 [22:01:29] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-10-15-192817 to 2024-11-13-145636 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090892 (https://phabricator.wikimedia.org/T356144) (owner: 10Jforrester) [22:02:33] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-10-15-192817 to 2024-11-13-145636 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090892 (https://phabricator.wikimedia.org/T356144) (owner: 10Jforrester) [22:03:26] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:04:01] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:09:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:10:08] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:10:56] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:10:58] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:11:48] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:14:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:15:10] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-10-10-202633 to 2024-11-12-161156 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090893 (https://phabricator.wikimedia.org/T356144) [22:15:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:15:15] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-10-10-202633 to 2024-11-12-161156 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090893 (https://phabricator.wikimedia.org/T356144) (owner: 10Jforrester) [22:16:36] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-10-10-202633 to 2024-11-12-161156 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090893 (https://phabricator.wikimedia.org/T356144) (owner: 10Jforrester) [22:17:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:54] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:18:40] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:19:59] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:20:55] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:20:59] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:21:49] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:22:31] 06SRE, 10Observability-Alerting, 06Traffic: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807#10319559 (10colewhite) The issue [[ https://grafana-rw.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=1731511462516&to=1731528409686&forceLogin&view... [22:25:03] (03PS1) 10Cwhite: sre: enable all DCs to complain about Puppet issues [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) [22:25:06] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bookworm [22:25:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:25:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10319567 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm [22:30:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:30:53] (03PS1) 10Bking: dse-k8s: add ingress config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) [22:33:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:33:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:51:20] (03PS2) 10Bking: dse-k8s: add ingress config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) [22:55:05] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:57:58] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:58:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wdqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:58:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wdqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:58:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:59:05] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:04:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [23:04:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [23:04:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:06:02] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [23:10:43] 06SRE: upgrade oauth2-proxy - https://phabricator.wikimedia.org/T379831#10319740 (10Peachey88) [23:15:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:17:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:20:38] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bookworm [23:22:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:22:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:22:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10319857 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm comple... [23:22:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:22:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:37:46] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:39:22] (03PS1) 10Eevans: corto: configure for production phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1090981 (https://phabricator.wikimedia.org/T356790) [23:40:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:40:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:40:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:41:10] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for es104 - jclark@cumin1002" [23:41:14] (03CR) 10CI reject: [V:04-1] corto: configure for production phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1090981 (https://phabricator.wikimedia.org/T356790) (owner: 10Eevans) [23:41:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for es104 - jclark@cumin1002" [23:41:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:42:37] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10320113 (10Jdlrobson) [23:42:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:47] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:10] (03PS2) 10Eevans: corto: configure for production phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1090981 (https://phabricator.wikimedia.org/T356790) [23:43:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:45:49] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090981 (https://phabricator.wikimedia.org/T356790) (owner: 10Eevans) [23:45:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:58:38] (03PS1) 10BCornwall: varnish: Pin varnish/modules versions to prod [puppet] - 10https://gerrit.wikimedia.org/r/1090984 (https://phabricator.wikimedia.org/T378737) [23:58:52] (03Abandoned) 10BCornwall: varnish: Pin varnish/modules versions to prod [puppet] - 10https://gerrit.wikimedia.org/r/1090984 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [23:59:24] (03PS2) 10BCornwall: apt/varnish: Add/Pin varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) [23:59:32] (03CR) 10BCornwall: [V:03+1] "I went ahead and altered this patch so that it creates/sets a "component/varnish-staging" component. That way we can set the packages to b" [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)