[00:04:48] (03PS2) 10C. Scott Ananian: Turn on Parsoid Selective Update metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) [00:05:11] (03CR) 10C. Scott Ananian: [C:04-2] "Whoops, let's not deploy this until the bug fix in I42bbd370c4eba46de40261511cf49d7c462f5bfe is merged and/or backported." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [00:10:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:30:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:55:48] (03PS1) 10Hamish: throttle.php: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 [00:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:58:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077487 (https://phabricator.wikimedia.org/T375055) (owner: 10Hamish) [00:58:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077488 (https://phabricator.wikimedia.org/T374898) (owner: 10Hamish) [00:58:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 (owner: 10Hamish) [01:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:59] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10198467 (10ssingh) >>! In T376291#10197890, @cmooney wrote: >>>! In T376291#10197677, @ssingh wrote: >> * It seem the network... [01:24:46] (03PS1) 10Kimberly Sarabia: DONOTMERGE: Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) [01:26:55] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:43] (03PS1) 10C. Scott Ananian: Deprecate ParserOutput::setLanguageLinks(null) [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077505 (https://phabricator.wikimedia.org/T376323) [01:36:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077505 (https://phabricator.wikimedia.org/T376323) (owner: 10C. Scott Ananian) [01:47:57] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:48:41] (03PS2) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [02:11:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:31:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:01:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:57] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10198601 (10Papaul) [03:34:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10198604 (10phaultfinder) [04:15:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [04:35:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [04:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:26:55] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:57] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10198641 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0600) [06:00:05] marostegui, Amir1, and arnaudb: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:16:42] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10198646 (10ayounsi) Let's use the latest recommended, so 23. Thx! [06:16:52] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10198647 (10ayounsi) [06:26:38] (03PS3) 10Brouberol: airflow: automatically inject the configuration checksum annotation on deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076755 (https://phabricator.wikimedia.org/T375886) [06:26:42] (03CR) 10Brouberol: [C:03+2] spark-operator: update base.certificate module to v2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076668 (https://phabricator.wikimedia.org/T365024) (owner: 10Brouberol) [06:27:15] (03PS2) 10Hamish: bjnwiktionary: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077488 (https://phabricator.wikimedia.org/T374898) [06:27:34] (03PS2) 10Hamish: bjnwiki: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077487 (https://phabricator.wikimedia.org/T375055) [06:41:54] (03PS1) 10Brouberol: Upgrade airflow-analytics-test to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077610 (https://phabricator.wikimedia.org/T373210) [06:41:54] (03PS1) 10Brouberol: Upgrade airflow-analytics to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077611 (https://phabricator.wikimedia.org/T373210) [06:41:55] (03PS1) 10Brouberol: Upgrade airflow-analytics-product to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077612 (https://phabricator.wikimedia.org/T373210) [06:41:55] (03PS1) 10Brouberol: Upgrade airflow-platform-eng to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077613 (https://phabricator.wikimedia.org/T373210) [06:41:56] (03PS1) 10Brouberol: Upgrade airflow-research to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077614 (https://phabricator.wikimedia.org/T373210) [06:41:57] (03PS1) 10Brouberol: Upgrade airflow-search to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077615 (https://phabricator.wikimedia.org/T373210) [06:42:01] (03PS1) 10Brouberol: Upgrade airflow-wmde to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077616 (https://phabricator.wikimedia.org/T373210) [06:42:05] (03PS1) 10Brouberol: Set airflow default version to 2.10.2-py3.10-20241002 [puppet] - 10https://gerrit.wikimedia.org/r/1077617 (https://phabricator.wikimedia.org/T373210) [06:44:56] (03PS1) 10Brouberol: Upgrade airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077618 [06:46:27] (03CR) 10Brouberol: [C:03+2] airflow: automatically inject the configuration checksum annotation on deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076755 (https://phabricator.wikimedia.org/T375886) (owner: 10Brouberol) [06:46:57] (03CR) 10Brouberol: [C:03+2] Upgrade airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077618 (owner: 10Brouberol) [06:47:20] (03CR) 10Slyngshede: [C:03+2] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1077465 (https://phabricator.wikimedia.org/T376334) (owner: 10Reedy) [06:49:42] (03Merged) 10jenkins-bot: signups_signup.html: Remove extra full stop [software/bitu] - 10https://gerrit.wikimedia.org/r/1077465 (https://phabricator.wikimedia.org/T376334) (owner: 10Reedy) [06:50:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [06:51:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [06:51:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:56:17] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:ircstream Allow enabling eventstream as a datasource. [puppet] - 10https://gerrit.wikimedia.org/r/1077395 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [06:58:32] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-analytics-test to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077610 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [06:58:57] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-analytics to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077611 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [06:59:16] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-analytics-product to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077612 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [06:59:28] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-platform-eng to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077613 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [06:59:30] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:59:46] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-research to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077614 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [06:59:48] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-analytics-test to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077610 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [06:59:54] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-analytics to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077611 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [07:00:01] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-search to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077615 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0700). nyaa~ [07:00:05] kart_, Hamishcz, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:27] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow-wmde to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077616 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [07:02:28] * kart_ is here, and will deploy.. [07:02:32] o/ [07:02:40] :) I'm here [07:03:33] (03PS3) 10KartikMistry: Section Translation: Add mos, kde and rsk Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076559 (https://phabricator.wikimedia.org/T375017) [07:03:46] Hamishcz: could i will remove expired throttle in my patch [07:04:05] anzx, yes can, I [07:04:14] I'd abandon that :) [07:04:18] ok [07:04:21] (03Abandoned) 10Hamish: throttle.php: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 (owner: 10Hamish) [07:04:30] RESOLVED: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076559 (https://phabricator.wikimedia.org/T375017) (owner: 10KartikMistry) [07:05:37] (03Merged) 10jenkins-bot: Section Translation: Add mos, kde and rsk Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076559 (https://phabricator.wikimedia.org/T375017) (owner: 10KartikMistry) [07:06:15] (03PS3) 10Anzx: IP limit exemption for WTS 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) [07:06:17] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1076559|Section Translation: Add mos, kde and rsk Wikipedias (T375017 T374815 T374644)]] [07:06:22] T375017: Post-creation work for rskwiki - https://phabricator.wikimedia.org/T375017 [07:06:23] T374815: Post-creation work for kgewiki - https://phabricator.wikimedia.org/T374815 [07:06:23] T374644: Post-creation work for moswiki - https://phabricator.wikimedia.org/T374644 [07:08:43] !log kartik@deploy2002 kartik: Backport for [[gerrit:1076559|Section Translation: Add mos, kde and rsk Wikipedias (T375017 T374815 T374644)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:12:14] !log kartik@deploy2002 kartik: Continuing with sync [07:14:58] (03PS4) 10Anzx: IP limit exemption for WTS 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) [07:15:34] (03PS5) 10Anzx: IP limit exemption for WTS 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) [07:15:36] (03CR) 10CI reject: [V:04-1] IP limit exemption for WTS 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) (owner: 10Anzx) [07:16:57] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076559|Section Translation: Add mos, kde and rsk Wikipedias (T375017 T374815 T374644)]] (duration: 10m 39s) [07:17:02] T375017: Post-creation work for rskwiki - https://phabricator.wikimedia.org/T375017 [07:17:03] T374815: Post-creation work for kgewiki - https://phabricator.wikimedia.org/T374815 [07:17:03] T374644: Post-creation work for moswiki - https://phabricator.wikimedia.org/T374644 [07:18:48] My patch is done. Any other deployers avaialble? urbanecm Amir1? [07:20:40] Because I need to go out for the lunch :/ [07:37:26] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-analytics-product to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077612 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [07:38:15] (03CR) 10Giuseppe Lavagetto: [C:04-1] "+1 to the message and the idea, I'd recommend some caution (see comment on the code) and/or adding some safeguards." [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [07:44:08] o/ [07:44:15] sorry a bit late cause I was triaging train tasks [07:44:48] Hamishcz: anzx: I will look at deploying your patches :) [07:45:08] hello again:) [07:45:28] (03CR) 10Hashar: [C:03+2] Deprecate ParserOutput::setLanguageLinks(null) [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077505 (https://phabricator.wikimedia.org/T376323) (owner: 10C. Scott Ananian) [07:46:34] (03Restored) 10Hashar: throttle.php: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 (owner: 10Hamish) [07:47:37] hashar: i have already removed expired throttle in https://gerrit.wikimedia.org/r/1077619 [07:48:01] yes I have noticed and that is excellent! [07:48:20] since Hamishcz also sent a patch to clear up them, I am taking that opportunity to try backporting TWO changes at the same time [07:48:31] (03PS2) 10Hashar: throttle.php: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 (owner: 10Hamish) [07:48:31] (03PS6) 10Hashar: IP limit exemption for WTS 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) (owner: 10Anzx) [07:48:49] that also makes a smaller diff in the second change :D [07:49:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 (owner: 10Hamish) [07:49:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) (owner: 10Anzx) [07:49:27] oh [07:49:32] that is working like a charm [07:49:45] thus I guess I could have carried the two log updates as well [07:50:02] (03Merged) 10jenkins-bot: throttle.php: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077502 (owner: 10Hamish) [07:50:08] (03Merged) 10jenkins-bot: IP limit exemption for WTS 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077619 (https://phabricator.wikimedia.org/T375794) (owner: 10Anzx) [07:50:35] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077502|throttle.php: Remove expired throttle]], [[gerrit:1077619|IP limit exemption for WTS 2024 (T375794)]] [07:50:38] T375794: IP limit exemption for WTS 2024 - https://phabricator.wikimedia.org/T375794 [07:51:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:53:10] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-platform-eng to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077613 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [07:53:49] !log hashar@deploy2002 anzx, hamishz, hashar: Backport for [[gerrit:1077502|throttle.php: Remove expired throttle]], [[gerrit:1077619|IP limit exemption for WTS 2024 (T375794)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:54:42] !log hashar@deploy2002 anzx, hamishz, hashar: Continuing with sync [07:55:01] (03CR) 10Elukey: "Really nice progress! Left some ideas/comments, lemme know :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [07:58:07] mwscript resetAuthenticationThrottle.php --wiki=metawiki --signup --ip 14.139.82.6 [07:58:53] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-research to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077614 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [07:59:16] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077502|throttle.php: Remove expired throttle]], [[gerrit:1077619|IP limit exemption for WTS 2024 (T375794)]] (duration: 08m 41s) [07:59:19] T375794: IP limit exemption for WTS 2024 - https://phabricator.wikimedia.org/T375794 [07:59:23] hashar: please clear memcached key [08:00:04] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0800) [08:00:23] hashar: i think clearing on wikidata, metawiki , commons and mediawikiwiki would be enough [08:00:40] ohh [08:00:45] (03CR) 10Stevemunene: [C:03+1] Set airflow default version to 2.10.2-py3.10-20241002 [puppet] - 10https://gerrit.wikimedia.org/r/1077617 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [08:01:42] anzx: oh well caught.. I would have forgotten for sure [08:02:35] hashar: thank you for deploying [08:03:41] !log Ran `mwscript resetAuthenticationThrottle.php --signup --ip 14.139.82.6` for `metawiki`, `mediawikiwiki` and `wikidatawiki` # T375794 [08:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] Hamishcz: I am doing the logos [08:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077487 (https://phabricator.wikimedia.org/T375055) (owner: 10Hamish) [08:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077488 (https://phabricator.wikimedia.org/T374898) (owner: 10Hamish) [08:04:38] sorry for the delays :D [08:04:50] no worries [08:05:06] (03Merged) 10jenkins-bot: bjnwiki: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077487 (https://phabricator.wikimedia.org/T375055) (owner: 10Hamish) [08:05:08] (03Merged) 10jenkins-bot: bjnwiktionary: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077488 (https://phabricator.wikimedia.org/T374898) (owner: 10Hamish) [08:05:14] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-search to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077615 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [08:05:37] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077487|bjnwiki: Update logo (T375055)]], [[gerrit:1077488|bjnwiktionary: Add logo (T374898)]] [08:05:40] T375055: Requesting logo change for bjn.wikipedia.org - https://phabricator.wikimedia.org/T375055 [08:05:41] T374898: Requesting logo change for bjn.wiktionary.org - https://phabricator.wikimedia.org/T374898 [08:05:47] once logos are done, I can go out for my dinner :0 [08:07:28] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-wmde to 2.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/1077616 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [08:07:55] !log hashar@deploy2002 hashar, hamishz: Backport for [[gerrit:1077487|bjnwiki: Update logo (T375055)]], [[gerrit:1077488|bjnwiktionary: Add logo (T374898)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:09] (03CR) 10Arnaudb: "I'll discard my CR so we can focus on this one!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [08:09:37] !log hashar@deploy2002 hashar, hamishz: Continuing with sync [08:13:27] (03CR) 10Brouberol: [C:03+2] Set airflow default version to 2.10.2-py3.10-20241002 [puppet] - 10https://gerrit.wikimedia.org/r/1077617 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [08:13:40] (03PS2) 10Brouberol: Set airflow default version to 2.10.2-py3.10-20241002 [puppet] - 10https://gerrit.wikimedia.org/r/1077617 (https://phabricator.wikimedia.org/T373210) [08:14:14] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077487|bjnwiki: Update logo (T375055)]], [[gerrit:1077488|bjnwiktionary: Add logo (T374898)]] (duration: 08m 37s) [08:14:19] T375055: Requesting logo change for bjn.wikipedia.org - https://phabricator.wikimedia.org/T375055 [08:14:19] T374898: Requesting logo change for bjn.wiktionary.org - https://phabricator.wikimedia.org/T374898 [08:14:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:14:41] Hamishcz: all done! [08:14:45] anzx: Hamishcz: thank you [08:14:58] hashar: much appreciate! [08:15:49] (03Merged) 10jenkins-bot: Deprecate ParserOutput::setLanguageLinks(null) [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077505 (https://phabricator.wikimedia.org/T376323) (owner: 10C. Scott Ananian) [08:16:18] Hamishcz: and thank you for all those config changes, they are very important :-] [08:17:22] w/ pleasure :) [08:17:40] (03CR) 10Brouberol: [V:03+2 C:03+2] Set airflow default version to 2.10.2-py3.10-20241002 [puppet] - 10https://gerrit.wikimedia.org/r/1077617 (https://phabricator.wikimedia.org/T373210) (owner: 10Brouberol) [08:17:42] now I can go to have my dinner lol [08:17:54] happy dinner! [08:18:23] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077505|Deprecate ParserOutput::setLanguageLinks(null) (T376323)]] [08:18:25] T376323: PHP Warning: Invalid argument supplied for foreach() - https://phabricator.wikimedia.org/T376323 [08:18:58] that patches converts the warning to a wfDeprecated() [08:20:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:20:42] !log hashar@deploy2002 hashar, cscott: Backport for [[gerrit:1077505|Deprecate ParserOutput::setLanguageLinks(null) (T376323)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:46] !log hashar@deploy2002 hashar, cscott: Continuing with sync [08:22:40] (03PS2) 10Giuseppe Lavagetto: git::replicated_local_repo: use ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/1077437 [08:24:48] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [08:25:30] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077505|Deprecate ParserOutput::setLanguageLinks(null) (T376323)]] (duration: 07m 07s) [08:25:33] T376323: PHP Warning: Invalid argument supplied for foreach() - https://phabricator.wikimedia.org/T376323 [08:28:56] I am running the train now [08:29:01] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077651 (https://phabricator.wikimedia.org/T375656) [08:29:02] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077651 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [08:29:45] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077651 (https://phabricator.wikimedia.org/T375656) (owner: 10TrainBranchBot) [08:30:18] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [08:36:46] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.25 refs T375656 [08:36:49] T375656: 1.43.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T375656 [08:38:15] (03CR) 10Volans: [C:04-1] "Fully agree with Luca's review, added some additional comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [08:40:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:41:26] (03CR) 10Elukey: [C:03+1] git::replicated_local_repo: use ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/1077437 (owner: 10Giuseppe Lavagetto) [08:47:09] (03CR) 10Giuseppe Lavagetto: [C:03+2] git::replicated_local_repo: use ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/1077437 (owner: 10Giuseppe Lavagetto) [08:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:59:23] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [08:59:43] (03PS1) 10Btullis: Update druid test config to drop unused segments automatically [puppet] - 10https://gerrit.wikimedia.org/r/1077653 (https://phabricator.wikimedia.org/T376118) [09:00:27] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4191/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077653 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [09:09:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:02] (03CR) 10Hnowlan: [C:03+1] "TIL changing the tracing cluster name changes the overall _local_cluster_name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074495 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [09:26:55] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:48] (03PS10) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [09:29:12] jouncebot: nowandnext [09:29:12] For the next 0 hour(s) and 30 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T0800) [09:29:12] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1000) [09:29:53] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4192/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [09:31:42] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: use SSL to connect to kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077396 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [09:32:45] (03Merged) 10jenkins-bot: rdf-streaming-updater: use SSL to connect to kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077396 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [09:32:53] (03PS1) 10Slyngshede: R:ircstream_sse Enable eventstream source for irc1004. [puppet] - 10https://gerrit.wikimedia.org/r/1077657 (https://phabricator.wikimedia.org/T376014) [09:35:09] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:35:15] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4193/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077657 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:35:27] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:36:03] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4194/console" [puppet] - 10https://gerrit.wikimedia.org/r/1077657 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:37:31] (03PS1) 10Urbanecm: Backport ReassignMenteesJob-related changes [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077658 (https://phabricator.wikimedia.org/T376124) [09:38:37] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [09:38:48] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [09:41:47] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [09:42:09] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [09:43:10] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10198925 (10elukey) I may have found some clue related to why the catalog contains so many stale things: https://github.com/distribution/distri... [09:44:15] (03CR) 10Urbanecm: [C:03+2] Backport ReassignMenteesJob-related changes [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077658 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [09:45:30] 06SRE, 06Infrastructure-Foundations, 06serviceops: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10198926 (10elukey) I tried to disable the Redis cache `blobdescriptor` setting `inmemory` for eqiad registry nodes, and I didn't hit the timeout proble... [09:46:46] (03CR) 10Elukey: [C:03+1] R:ircstream_sse Enable eventstream source for irc1004. [puppet] - 10https://gerrit.wikimedia.org/r/1077657 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:47:58] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:51:32] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:ircstream_sse Enable eventstream source for irc1004. [puppet] - 10https://gerrit.wikimedia.org/r/1077657 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:54:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10198963 (10phaultfinder) [09:57:58] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:58:18] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@b715af7]: T375153 [09:58:22] T375153: ETL pipeline for Automoderator daily monitoring metrics - https://phabricator.wikimedia.org/T375153 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1000) [10:00:14] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM irc1004.wikimedia.org [10:00:54] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@b715af7]: T375153 (duration: 02m 44s) [10:04:10] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM irc1004.wikimedia.org [10:04:54] (03PS1) 10Elukey: sre.hosts.provision: add IPv4AutoConfigEnabled for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1077660 [10:06:07] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:06:27] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:06:58] (03CR) 10Elukey: "The property IPv4AutoConfigEnabled is a read only property and cannot be assigned a value." [cookbooks] - 10https://gerrit.wikimedia.org/r/1077660 (owner: 10Elukey) [10:08:15] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:10:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:11:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077658 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [10:11:36] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:18:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077422 (https://phabricator.wikimedia.org/T376292) (owner: 10Msz2001) [10:22:03] (03Merged) 10jenkins-bot: Backport ReassignMenteesJob-related changes [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077658 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [10:22:23] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077658|Backport ReassignMenteesJob-related changes (T376124)]] [10:22:26] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [10:24:38] (03PS2) 10Elukey: sre.hosts.provision: explicitly disable DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1077660 [10:25:15] 06SRE, 06Infrastructure-Foundations, 10netops: Add link from cloudsw1-e4-eqiad to cloudsw1-f4-eiqad - https://phabricator.wikimedia.org/T372061#10199000 (10cmooney) With the gnmi stats in place we see fairly consistent drops on these links from cloudsw1-d5-eqiad: https://grafana-rw.wikimedia.org/d/5p97dAASz... [10:25:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:25:55] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:27:04] (03CR) 10Ayounsi: [C:03+1] sre.hosts.provision: explicitly disable DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1077660 (owner: 10Elukey) [10:28:08] (03PS3) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [10:29:16] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:29:18] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077658|Backport ReassignMenteesJob-related changes (T376124)]] (duration: 06m 54s) [10:29:21] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [10:30:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T367856)', diff saved to https://phabricator.wikimedia.org/P69453 and previous config saved to /var/cache/conftool/dbconfig/20241003-103001-ladsgroup.json [10:30:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:30:14] (03PS1) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [10:30:55] (03PS4) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [10:31:54] (03PS1) 10Slyngshede: IRCStream: Failover to CODFW. [dns] - 10https://gerrit.wikimedia.org/r/1077662 [10:40:38] (03CR) 10CI reject: [V:04-1] redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:45:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P69454 and previous config saved to /var/cache/conftool/dbconfig/20241003-104508-ladsgroup.json [10:45:34] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: explicitly disable DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1077660 (owner: 10Elukey) [10:46:52] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [10:47:58] (03PS2) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [10:53:33] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [10:59:50] (03PS5) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [11:00:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P69456 and previous config saved to /var/cache/conftool/dbconfig/20241003-110015-ladsgroup.json [11:01:01] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [11:15:08] (03PS2) 10Ammarpad: logos: Sync config.yaml and logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077392 (https://phabricator.wikimedia.org/T374430) [11:15:08] (03PS2) 10Ammarpad: hawiki: Add temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077400 (https://phabricator.wikimedia.org/T376049) [11:15:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T367856)', diff saved to https://phabricator.wikimedia.org/P69457 and previous config saved to /var/cache/conftool/dbconfig/20241003-111522-ladsgroup.json [11:15:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [11:15:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:15:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [11:15:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T367856)', diff saved to https://phabricator.wikimedia.org/P69458 and previous config saved to /var/cache/conftool/dbconfig/20241003-111544-ladsgroup.json [11:20:44] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10199081 (10cmooney) >>! In T374587#10160970, @ayounsi wrote: > It would indeed be great to have redundancy for the `fmsw`, but as that device... [11:21:29] (03PS1) 10Urbanecm: ReassignMenteesJob: Do not schedule follow-up jobs when first job fails [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077673 (https://phabricator.wikimedia.org/T376124) [11:25:01] (03CR) 10Urbanecm: [C:03+2] "needs to be deployed ASAP, cf T376124#10199089" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077673 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [11:25:13] (03PS1) 10Arturo Borrero Gonzalez: cloud: set profile::resolving::timeout to 5 [puppet] - 10https://gerrit.wikimedia.org/r/1077675 (https://phabricator.wikimedia.org/T374830) [11:29:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077673 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [11:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:35:30] (03PS7) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [11:35:30] (03PS8) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [11:35:30] (03PS4) 10Btullis: Absent the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) [11:36:04] (03CR) 10CI reject: [V:04-1] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [11:36:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:37:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4195/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [11:47:19] (03PS8) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [11:47:19] (03PS9) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [11:47:19] (03PS5) 10Btullis: Absent the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) [11:47:50] (03CR) 10CI reject: [V:04-1] Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [11:57:20] (03CR) 10Elukey: [C:03+1] IRCStream: Failover to CODFW. [dns] - 10https://gerrit.wikimedia.org/r/1077662 (owner: 10Slyngshede) [12:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1200) [12:00:18] (03PS1) 10Hnowlan: php-cli: include mercurius in 8.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) [12:02:23] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:04:37] (03PS9) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [12:04:37] (03PS10) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [12:04:37] (03PS6) 10Btullis: Absent the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) [12:04:39] (03PS1) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:05:07] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:05:22] (03Merged) 10jenkins-bot: ReassignMenteesJob: Do not schedule follow-up jobs when first job fails [extensions/GrowthExperiments] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077673 (https://phabricator.wikimedia.org/T376124) (owner: 10Urbanecm) [12:05:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:05:37] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077673|ReassignMenteesJob: Do not schedule follow-up jobs when first job fails (T376124)]] [12:05:49] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [12:06:38] (03PS2) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:06:57] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:07:32] (03CR) 10Slyngshede: [C:03+2] IRCStream: Failover to CODFW. [dns] - 10https://gerrit.wikimedia.org/r/1077662 (owner: 10Slyngshede) [12:09:11] (03PS3) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:09:31] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:09:47] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:10:12] (03PS4) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:10:31] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:12:06] (03PS5) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:12:25] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:13:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:13:39] !log urbanecm@deploy2002 scap failed: local variable 'e' referenced before assignment (scap version: 4.108.0-1) (duration: 08m 02s) [12:13:59] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1077673|ReassignMenteesJob: Do not schedule follow-up jobs when first job fails (T376124)]] [12:14:01] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [12:14:03] (03PS6) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:14:22] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:20:06] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077698 [12:20:47] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077673|ReassignMenteesJob: Do not schedule follow-up jobs when first job fails (T376124)]] (duration: 06m 47s) [12:20:50] T376124: Removing a mentor from the list of mentors does not always reassign newcomers - https://phabricator.wikimedia.org/T376124 [12:23:18] (03PS7) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:23:39] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:23:59] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10199200 (10cmooney) >>! In T376291#10198467, @ssingh wrote: > You are basing `dns_k8s_reverse_delegation` on `hieradata/common... [12:25:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:26:58] (03PS8) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:27:18] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:28:03] (03PS9) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:28:30] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:30:47] (03CR) 10Ladsgroup: WIP: Migrate wikitech dumps to snapshot servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:31:54] (03CR) 10Ladsgroup: WIP: Migrate wikitech dumps to snapshot servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:32:30] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:34:54] (03Abandoned) 10Arturo Borrero Gonzalez: cloud: set profile::resolving::timeout to 5 [puppet] - 10https://gerrit.wikimedia.org/r/1077675 (https://phabricator.wikimedia.org/T374830) (owner: 10Arturo Borrero Gonzalez) [12:38:51] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10199231 (10aborrero) [12:41:28] (03PS10) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:41:31] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10199227 (10aborrero) 05Open→03Resolved [12:41:37] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10199229 (10aborrero) [12:42:25] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10199233 (10aborrero) please @Jhancock.wm try again with this one after the patch I merged yesterday. [12:44:56] (03CR) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:45:23] (03PS11) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:45:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:45:42] (03CR) 10CI reject: [V:04-1] WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:48:21] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10199236 (10aborrero) Created: * https://netbox.wikimedia.org/ipam/prefixes/1085/ * https://netbox.wikimedia.org/ipam/prefixes/1086/ * https://netbox.... [12:52:30] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:53:44] (03PS12) 10Effie Mouzeli: WIP: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [12:54:39] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [12:55:41] (03PS1) 10Aklapper: Phabricator: Make Popen constructor in phab_epipe.py return strings [puppet] - 10https://gerrit.wikimedia.org/r/1077709 (https://phabricator.wikimedia.org/T356077) [12:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:57:58] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:59:37] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: codfw: 1 VM for ircstream-sse - https://phabricator.wikimedia.org/T376381 (10elukey) 03NEW [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1300). Please do the needful. [13:00:05] cscott, msz2001, Ammar, and Ammar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host irc2004.wikimedia.org [13:00:21] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [13:00:38] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: codfw: 1 VM for ircstream-sse - https://phabricator.wikimedia.org/T376381#10199296 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +-------+--... [13:02:08] (03PS1) 10Vgutierrez: hiera: Switch to digicert-2024 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) [13:03:28] (03PS2) 10Vgutierrez: hiera: Switch to digicert-2024 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) [13:06:19] (03CR) 10Btullis: "I added some validation to the path parameter, so that it ensures it is an absolute posix path name and does not end with a slash." [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [13:06:24] (03PS13) 10Effie Mouzeli: modules::snapshot: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) [13:08:35] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc2004.wikimedia.org - elukey@cumin1002" [13:09:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc2004.wikimedia.org - elukey@cumin1002" [13:09:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:09:07] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache irc2004.wikimedia.org on all recursors [13:09:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) irc2004.wikimedia.org on all recursors [13:09:37] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc2004.wikimedia.org - elukey@cumin1002" [13:09:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc2004.wikimedia.org - elukey@cumin1002" [13:09:43] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add wan IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [13:10:01] (03CR) 10CI reject: [V:04-1] cloudgw: add wan IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [13:10:07] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add wan IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [13:10:26] (03CR) 10CI reject: [V:04-1] cloudgw: add wan IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [13:10:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [13:10:34] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host irc2004.wikimedia.org with OS bookworm [13:12:48] (03CR) 10Vgutierrez: [C:04-2] "Do not merge till 2024-10-07" [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [13:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:40] (03PS3) 10Kamila Součková: analytics_privatedata_users: add seanleong-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) [13:16:58] (03CR) 10Kamila Součková: analytics_privatedata_users: add seanleong-wmde (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila Součková) [13:23:11] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on irc2004.wikimedia.org with reason: host reimage [13:23:33] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10199351 (10cmooney) I had a quick stab at creating the data for the `dns_reverse_zones.yaml` file from the dns repo and it's f... [13:23:44] I'm here, late for the backport window [13:23:53] But it doesn't seem to be happening? [13:24:01] It doesn't seem [13:26:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10199354 (10kamila) @Dzahn it appears that Sean is already in the LDAP groups, unless I'm misunderstanding something: ` kamil... [13:26:07] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [13:26:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on irc2004.wikimedia.org with reason: host reimage [13:26:55] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:09] (03CR) 10Ssingh: [C:03+1] durum: Remove rsa-2048 certs from nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1075613 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [13:28:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10199356 (10kamila) [13:28:35] (03CR) 10Ssingh: [C:03+1] hiera: Switch to digicert-2024 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [13:28:51] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:30:57] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [13:31:17] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [13:32:08] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [13:34:46] 06SRE, 06Infrastructure-Foundations, 06serviceops: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10199389 (10elukey) Current setup: * registry100* hosts using inmemory blobdescriptor cache * registry200* hosts using redis blobdescription cache The... [13:36:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10199363 (10kamila) @odimitrijevic @Milimetric @WDoranWMF @Ahoelzl @Ottomata could one of you approve Sean's access request p... [13:37:07] (03PS1) 10Giuseppe Lavagetto: puppetserver: add conftool2git public key [puppet] - 10https://gerrit.wikimedia.org/r/1077716 [13:37:34] (03CR) 10Giuseppe Lavagetto: [C:03+2] puppetserver: add conftool2git public key [puppet] - 10https://gerrit.wikimedia.org/r/1077716 (owner: 10Giuseppe Lavagetto) [13:38:15] (03PS1) 10Brouberol: global_config: register an external service for each airflow host [puppet] - 10https://gerrit.wikimedia.org/r/1077717 (https://phabricator.wikimedia.org/T376385) [13:39:10] (03PS2) 10Brouberol: global_config: register an external service for each airflow host [puppet] - 10https://gerrit.wikimedia.org/r/1077717 (https://phabricator.wikimedia.org/T376385) [13:39:41] (03CR) 10Xcollazo: "Why would wikitech (aka labswiki) be dumped as part of the miscellaneous dumps, and not as part of the XML dumps?" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [13:40:04] (03PS3) 10Brouberol: global_config: register an external service for each airflow host [puppet] - 10https://gerrit.wikimedia.org/r/1077717 (https://phabricator.wikimedia.org/T376385) [13:40:28] (03CR) 10JHathaway: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [13:40:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host irc2004.wikimedia.org with OS bookworm [13:40:32] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc2004.wikimedia.org [13:40:33] (03PS3) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [13:41:10] (03CR) 10Ladsgroup: "This powers wikitech-static.wikimedia.org and needs to be done daily with different systems and requirements. We should eventually move th" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [13:42:08] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4200/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077717 (https://phabricator.wikimedia.org/T376385) (owner: 10Brouberol) [13:42:40] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host irc2004.wikimedia.org [13:44:07] (03CR) 10Xcollazo: "Can we please associate this change to a phab ticket?" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [13:45:01] (03CR) 10Ladsgroup: "It is associated to a phab ticket." [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [13:45:23] 06SRE, 06DBA, 10Sustainability (Incident Followup), 07Wikimedia-production-error: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10199440 (10ABran-WMF) [13:45:25] FIRING: SystemdUnitFailed: conftool2git.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:27] (03CR) 10Ladsgroup: "It's not related to that at all." [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [13:45:37] (03CR) 10Bking: [C:03+1] global_config: register an external service for each airflow host [puppet] - 10https://gerrit.wikimedia.org/r/1077717 (https://phabricator.wikimedia.org/T376385) (owner: 10Brouberol) [13:46:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2004.wikimedia.org [13:48:02] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: register an external service for each airflow host [puppet] - 10https://gerrit.wikimedia.org/r/1077717 (https://phabricator.wikimedia.org/T376385) (owner: 10Brouberol) [13:50:11] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:51:12] (03CR) 10Brouberol: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:52:07] (03PS1) 10Klausman: hiera: add pseudosecret for S3 access from ml-lab machines [labs/private] - 10https://gerrit.wikimedia.org/r/1077720 [13:52:11] (03CR) 10Klausman: [V:03+2 C:03+2] "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1077720 (owner: 10Klausman) [13:52:39] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: codfw: 1 VM for ircstream-sse - https://phabricator.wikimedia.org/T376381#10199454 (10elukey) 05Open→03Resolved a:03elukey irc2004.codfw.wmnet up and running! [13:54:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:55:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:55:43] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10199481 (10kamila) p:05Triage→03Medium [13:57:21] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10199496 (10Jhancock.wm) [13:59:53] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10199502 (10phaultfinder) [14:02:13] (03PS1) 10Giuseppe Lavagetto: conftool2git: create user home directory [puppet] - 10https://gerrit.wikimedia.org/r/1077722 [14:02:30] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10199506 (10Jhancock.wm) @jijiki hi, we got the servers in this week and are going to be racking them today. Could you update the operations and puppet repo for us? I'm hoping to get them ins... [14:04:39] (03PS4) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [14:05:03] (03CR) 10Ayounsi: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [14:06:04] (03CR) 10Xcollazo: "I do not think we should be adding more dumps to the Dumps 1.0 infra." [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:07:45] (03CR) 10Ladsgroup: "To explain in more depth, this doesn't add any new dumps. It moves the dump that was made daily on wikitech.wikimedia.org/dumps/ (via http" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:08:27] (03PS2) 10Hnowlan: php-cli: include mercurius in 8.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) [14:08:30] (03PS2) 10Giuseppe Lavagetto: conftool2git: create user home directory [puppet] - 10https://gerrit.wikimedia.org/r/1077722 [14:11:08] (03PS2) 10Ayounsi: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) [14:13:00] (03PS3) 10Giuseppe Lavagetto: conftool2git: create user home directory [puppet] - 10https://gerrit.wikimedia.org/r/1077722 [14:14:27] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10199522 (10Jhancock.wm) a:03Jhancock.wm [14:15:25] RESOLVED: SystemdUnitFailed: conftool2git.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:18] (03CR) 10Giuseppe Lavagetto: [C:03+1] "I understand in principle the concerns about maintainability of dumps 1.0, but this dump *was already* part of that infra, just ran on a d" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:19:18] (03CR) 10Arnaudb: "ditto!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [14:20:06] (03CR) 10Elukey: [C:03+1] conftool2git: create user home directory [puppet] - 10https://gerrit.wikimedia.org/r/1077722 (owner: 10Giuseppe Lavagetto) [14:20:58] (03CR) 10Giuseppe Lavagetto: [C:03+2] conftool2git: create user home directory [puppet] - 10https://gerrit.wikimedia.org/r/1077722 (owner: 10Giuseppe Lavagetto) [14:21:58] (03PS1) 10Elukey: Revert "IRCStream: Failover to CODFW." [dns] - 10https://gerrit.wikimedia.org/r/1077723 [14:22:42] (03CR) 10Slyngshede: [C:03+1] Revert "IRCStream: Failover to CODFW." [dns] - 10https://gerrit.wikimedia.org/r/1077723 (owner: 10Elukey) [14:22:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10199540 (10Jclark-ctr) @VRiley-WMF D5 is wmcs rack? netbox says D6 [14:23:06] (03CR) 10Elukey: [C:03+2] Revert "IRCStream: Failover to CODFW." [dns] - 10https://gerrit.wikimedia.org/r/1077723 (owner: 10Elukey) [14:23:26] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:23:39] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10199548 (10MBywater-WMF) Hi all, thanks for your help on this! Feel free to ping me if you need any assistance from ITS. [14:23:59] (03CR) 10Xcollazo: "Ok, so if this breaks, the patch owner will fix it, yes?" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:25:33] (03PS1) 10Elukey: Add ircstream-sse.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1077724 (https://phabricator.wikimedia.org/T376014) [14:26:39] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt aqs1022 - jclark@cumin1002" [14:28:04] (03CR) 10Slyngshede: [C:03+1] Add ircstream-sse.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1077724 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:28:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt aqs1022 - jclark@cumin1002" [14:28:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:48] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aqs1022 [14:29:07] !log jclark@cumin1002 END (ERROR) - Cookbook sre.network.configure-switch-interfaces (exit_code=97) for host aqs1022 [14:29:11] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10199563 (10nisrael) For all contributors to this group I do want to stress that this is an urgent issue. We cannot have Lisa receiving donor responses to... [14:29:17] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aqs1022 [14:30:16] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10199581 (10jhathaway) @nisrael would it be possible to provide an example raw message, including headers? [14:30:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aqs1022 [14:31:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aqs1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:31:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:34:52] (03PS14) 10Ladsgroup: modules::snapshot: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:35:56] (03CR) 10Ladsgroup: modules::snapshot: Migrate wikitech dumps to snapshot servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:36:08] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10199623 (10nisrael) Do you mean an example of one of the responses she's been receiving? [14:37:58] FIRING: [8x] CertAlmostExpired: Certificate for service kubestagemaster2003:6443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:39:46] (03PS15) 10Ladsgroup: modules::snapshot: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:39:53] (03CR) 10Ladsgroup: [V:03+2 C:03+2] modules::snapshot: Migrate wikitech dumps to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1077684 (https://phabricator.wikimedia.org/T374114) (owner: 10Effie Mouzeli) [14:41:00] (03PS1) 10JHathaway: remove dev gems from default gem set [puppet] - 10https://gerrit.wikimedia.org/r/1077725 [14:41:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077725 (owner: 10JHathaway) [14:42:26] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [14:42:53] (03CR) 10Elukey: [C:03+2] Add ircstream-sse.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1077724 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:43:49] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [14:46:04] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [14:46:36] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [14:50:25] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Remove 'staging' flag from distribution name [puppet] - 10https://gerrit.wikimedia.org/r/1069991 (owner: 10EoghanGaffney) [14:52:22] (03PS1) 10Giuseppe Lavagetto: conftool2git: only run on the active host [puppet] - 10https://gerrit.wikimedia.org/r/1077728 [14:52:29] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@b715af7]: Deploy latest DAGs to the analytics Airflow instance. T373694. T375402 [14:52:36] T373694: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support data rewrite - https://phabricator.wikimedia.org/T373694 [14:52:36] T375402: Tune Dumps 2.0 hourly ingestion jobs - https://phabricator.wikimedia.org/T375402 [14:54:25] (03CR) 10Giuseppe Lavagetto: [C:03+2] conftool2git: only run on the active host [puppet] - 10https://gerrit.wikimedia.org/r/1077728 (owner: 10Giuseppe Lavagetto) [14:56:02] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@b715af7]: Deploy latest DAGs to the analytics Airflow instance. T373694. T375402 (duration: 03m 33s) [14:57:54] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10199728 (10jhathaway) >>! In T375643#10199623, @nisrael wrote: > By this, do you mean an example of one of the responses she's been receiving? yes, exactly [14:58:23] (03PS1) 10Giuseppe Lavagetto: conftool2git: fix parameter for systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/1077730 [14:58:38] (03CR) 10Ssingh: [C:03+2] durum: Remove rsa-2048 certs from nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1075613 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [14:58:49] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] conftool2git: fix parameter for systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/1077730 (owner: 10Giuseppe Lavagetto) [15:00:05] hashar and brennen: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1500) [15:00:41] !log ongoing Junos upgrade on mr1-codfw [15:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:39] I have moved the train log triage ahead in time since we have another meeting starting now [15:03:40] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10199771 (10MBywater-WMF) @nisrael Here are some instructions on hot get email headers: https://support.google.com/mail/answer/29436?hl=en [15:07:59] (03CR) 10Btullis: [C:03+1] "This looks fine to me, assuming the rebase is clean." [puppet] - 10https://gerrit.wikimedia.org/r/1077440 (owner: 10Majavah) [15:09:13] (03PS1) 10Ladsgroup: wikitechdumps: Fix path to the exec file [puppet] - 10https://gerrit.wikimedia.org/r/1077733 (https://phabricator.wikimedia.org/T374114) [15:10:16] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077733 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [15:11:37] (03PS2) 10Ladsgroup: wikitechdumps: Fix path to the exec file [puppet] - 10https://gerrit.wikimedia.org/r/1077733 (https://phabricator.wikimedia.org/T374114) [15:11:47] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077733 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [15:11:50] (03PS1) 10Amire80: Update jquery.ime from upstream [extensions/UniversalLanguageSelector] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077734 [15:12:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/UniversalLanguageSelector] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077734 (owner: 10Amire80) [15:13:36] (03PS1) 10Giuseppe Lavagetto: conftool2git: fix the pdb query [puppet] - 10https://gerrit.wikimedia.org/r/1077735 [15:13:39] (03CR) 10Ladsgroup: [C:03+2] wikitechdumps: Fix path to the exec file [puppet] - 10https://gerrit.wikimedia.org/r/1077733 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [15:13:48] (03PS3) 10Hnowlan: php-cli: include mercurius in 8.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) [15:13:48] (03CR) 10Effie Mouzeli: [C:03+1] wikitechdumps: Fix path to the exec file [puppet] - 10https://gerrit.wikimedia.org/r/1077733 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [15:15:43] (03PS4) 10Scott French: types: extend Profile::Mediawiki_deployment [puppet] - 10https://gerrit.wikimedia.org/r/1077479 (https://phabricator.wikimedia.org/T370934) [15:15:43] (03CR) 10Scott French: "Many thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1077479 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [15:15:56] (03PS5) 10Scott French: hieradata: use 'releases' in mw-debug mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1077480 (https://phabricator.wikimedia.org/T370934) [15:16:23] (03PS5) 10Scott French: hieradata: add mw-debug "next" release to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1077481 (https://phabricator.wikimedia.org/T372604) [15:16:58] (03CR) 10Giuseppe Lavagetto: [C:03+2] conftool2git: fix the pdb query [puppet] - 10https://gerrit.wikimedia.org/r/1077735 (owner: 10Giuseppe Lavagetto) [15:20:02] (03PS1) 10Ladsgroup: wikitechdumps: Update the path to mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077736 (https://phabricator.wikimedia.org/T374114) [15:20:47] (03CR) 10Ladsgroup: [C:03+2] wikitechdumps: Update the path to mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077736 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [15:22:17] (03PS1) 10Effie Mouzeli: wikitechdumps: Fix path to the mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077737 [15:22:27] (03CR) 10CI reject: [V:04-1] wikitechdumps: Fix path to the mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077737 (owner: 10Effie Mouzeli) [15:23:25] (03Abandoned) 10Effie Mouzeli: wikitechdumps: Fix path to the mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077737 (owner: 10Effie Mouzeli) [15:24:04] (03CR) 10Ssingh: [C:03+1] Delegate Kubernetes POD IP reverse ranges to k8s control-plane nodes [dns] - 10https://gerrit.wikimedia.org/r/1077486 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [15:26:12] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:08] (03PS1) 10JHathaway: WIP: add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 [15:31:17] 06SRE, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291#10199890 (10ssingh) >>! In T376291#10199200, @cmooney wrote: >>>! In T376291#10198467, @ssingh wrote: >> You are basing `dns_k8... [15:31:30] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10199892 (10elukey) Fouth's day summary: * Created irc2004.codfw.wmnet so we have a fai... [15:33:53] (03PS1) 10Ladsgroup: Update URL to wikitech dumps [wikitech-static] - 10https://gerrit.wikimedia.org/r/1077741 (https://phabricator.wikimedia.org/T374114) [15:35:19] (03PS2) 10Cathal Mooney: Delegate Kubernetes POD IP reverse ranges to k8s control-plane nodes [dns] - 10https://gerrit.wikimedia.org/r/1077486 (https://phabricator.wikimedia.org/T376291) [15:36:12] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:26] !log Junos upgrade on mr1-codfw complete [15:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:55] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10199906 (10Papaul) [15:40:14] (03PS1) 10Jforrester: mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) [15:41:12] (03PS1) 10Ladsgroup: wikitech: Get rid of the old mw-xml dumper file and cron [puppet] - 10https://gerrit.wikimedia.org/r/1077744 (https://phabricator.wikimedia.org/T374114) [15:41:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10199912 (10VRiley-WMF) @Jclark-ctr You're right. I had a typo. It is in Rack D6 as per the request "Row D, Preferred rack (in order): [D5, D6, D8, D2, D7, D4]" [15:41:59] (03CR) 10CI reject: [V:04-1] mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) (owner: 10Jforrester) [15:42:58] (03PS4) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:45:01] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:45:02] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:45:08] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [15:45:43] (03PS5) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:45:57] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:45:58] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:46:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:45] (03PS6) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:46:53] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:46:54] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:46:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:25] (03PS7) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:47:32] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:47:33] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards [15:47:52] (03PS2) 10Jforrester: mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) [15:49:36] (03CR) 10CI reject: [V:04-1] mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) (owner: 10Jforrester) [15:49:57] (03CR) 10Cathal Mooney: [C:03+2] Delegate Kubernetes POD IP reverse ranges to k8s control-plane nodes [dns] - 10https://gerrit.wikimedia.org/r/1077486 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [15:50:29] !log merging patch to add k8s pod IP range reverse delegations to dns T376291 [15:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:32] T376291: Delegate reverse DNS zones for k8s pod IP ranges on authdns servers - https://phabricator.wikimedia.org/T376291 [15:51:25] (03CR) 10Ladsgroup: "It's not valid yaml it seems. But also it's missing a dblist (see translate or flaggedrevs). We don't have a wikifunctions dblist but mayb" [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) (owner: 10Jforrester) [15:51:29] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:51:31] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:51:32] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [15:52:12] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:52:14] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:52:31] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10199939 (10Seddon) [15:52:37] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10199940 (10Seddon) Updated the task. Approved as manager. [15:52:59] (03PS8) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:53:08] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:53:50] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 59.75.192.10.in-addr.arpa on all recursors [15:53:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 59.75.192.10.in-addr.arpa on all recursors [15:57:05] (03PS9) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [15:58:03] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards [15:58:06] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:30] \o/ [16:01:34] {◕ ◡ ◕} [16:02:36] (03CR) 10Effie Mouzeli: [C:03+2] Update URL to wikitech dumps [wikitech-static] - 10https://gerrit.wikimedia.org/r/1077741 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [16:02:43] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] Update URL to wikitech dumps [wikitech-static] - 10https://gerrit.wikimedia.org/r/1077741 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [16:02:51] 😌 [16:09:03] (03PS2) 10Ladsgroup: dumps: Stop fetching custom Wikitech dumps [puppet] - 10https://gerrit.wikimedia.org/r/1077440 (https://phabricator.wikimedia.org/T374114) (owner: 10Majavah) [16:09:30] (03CR) 10CI reject: [V:04-1] wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:11:13] (03PS2) 10Effie Mouzeli: wikitech: Get rid of the old mw-xml dumper file and cron [puppet] - 10https://gerrit.wikimedia.org/r/1077744 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [16:11:30] (03CR) 10Effie Mouzeli: [C:03+1] wikitech: Get rid of the old mw-xml dumper file and cron [puppet] - 10https://gerrit.wikimedia.org/r/1077744 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [16:13:12] (03PS10) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [16:14:03] (03CR) 10Effie Mouzeli: [C:03+1] dumps: Stop fetching custom Wikitech dumps [puppet] - 10https://gerrit.wikimedia.org/r/1077440 (https://phabricator.wikimedia.org/T374114) (owner: 10Majavah) [16:14:29] (03CR) 10Ladsgroup: [C:03+2] dumps: Stop fetching custom Wikitech dumps [puppet] - 10https://gerrit.wikimedia.org/r/1077440 (https://phabricator.wikimedia.org/T374114) (owner: 10Majavah) [16:15:12] (03PS3) 10Effie Mouzeli: wikitech: Get rid of the old mw-xml dumper file and cron [puppet] - 10https://gerrit.wikimedia.org/r/1077744 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [16:15:22] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wikitech: Get rid of the old mw-xml dumper file and cron [puppet] - 10https://gerrit.wikimedia.org/r/1077744 (https://phabricator.wikimedia.org/T374114) (owner: 10Ladsgroup) [16:17:34] (03PS11) 10Ryan Kemper: wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) [16:19:39] (03CR) 10Jforrester: "Ah, global_block_whitelist doesn't have a dblist, which is what I was copying." [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) (owner: 10Jforrester) [16:22:52] (03PS3) 10RLazarus: scap: Add a deprecation warning to classic mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) [16:24:15] (03CR) 10Scott French: [C:03+2] types: extend Profile::Mediawiki_deployment [puppet] - 10https://gerrit.wikimedia.org/r/1077479 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [16:24:39] (03PS3) 10Jforrester: mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) [16:25:01] (03PS4) 10Jforrester: mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) [16:25:42] (03CR) 10RLazarus: scap: Add a deprecation warning to classic mwscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [16:26:08] (03PS3) 10Ayounsi: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) [16:30:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [16:30:54] (03CR) 10Giuseppe Lavagetto: [C:03+1] scap: Add a deprecation warning to classic mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [16:31:13] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:31:31] (03CR) 10Ryan Kemper: "Tested and working" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:31:33] (03CR) 10Ryan Kemper: [C:03+2] wdqs.data-transfer: refuse xfer on differing jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/1077059 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [16:50:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [16:56:11] (03CR) 10Dzahn: [C:03+2] "ah, it's about utf-8 encoding! Yea, let's try it. No risk since it's currently not working." [puppet] - 10https://gerrit.wikimedia.org/r/1077709 (https://phabricator.wikimedia.org/T356077) (owner: 10Aklapper) [16:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:59:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1700). [17:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1700). [17:00:44] here, and will get started shortly [17:01:52] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1077753 [17:02:29] (03CR) 10Scott French: [C:03+2] hieradata: use 'releases' in mw-debug mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1077480 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [17:02:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [17:07:39] (03PS9) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [17:09:58] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [17:11:07] !log swfrench@deploy2002 Started scap sync-world: Testing after mediawiki-deployments.yaml format change - T370934 [17:11:10] T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions - https://phabricator.wikimedia.org/T370934 [17:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:57] !log swfrench@deploy2002 Finished scap sync-world: Testing after mediawiki-deployments.yaml format change - T370934 (duration: 02m 50s) [17:17:14] (03PS5) 10Pppery: Missing.php: Improve detection of interwikis in certain cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075957 (https://phabricator.wikimedia.org/T363538) [17:20:34] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200317 (10nisrael) @jhathaway I can attempt to do this, but I don't have access to this inbox and it's getting a bit techy. It may take me a bit to get... [17:22:06] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305#10200328 (10BCornwall) 05Open→03Resolved a:03BCornwall [17:24:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:28:08] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200360 (10Dzahn) For the record, the redirect still exists as it did in the past. Our MX server exim alias file for wikipedia.org has ` 45 # Lisa -... [17:29:10] alright, I believe I'm done with the window [17:29:43] RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:41:29] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) (owner: 10Jforrester) [17:41:37] !log codesearch was broken - VM was down - rebooted - restarting all the indices is a bit slow but mostly back up now [17:41:37] (03PS5) 10Jforrester: mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) [17:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:39] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Add WikiLambda's tables to the catalogue [puppet] - 10https://gerrit.wikimedia.org/r/1077743 (https://phabricator.wikimedia.org/T363581) (owner: 10Jforrester) [17:45:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200444 (10Dzahn) Is it possible the outgoing fundraising mails have a ` Reply-To:`-header of lisa@wikimedia.org, maybe through a typo somewhere? [17:57:42] (03CR) 10RLazarus: [C:03+2] scap: Add a deprecation warning to classic mwscript [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T1800) [18:00:15] !log codesearch - ran out of disk due to 11G /var/log/account/pacct file - manually ran /etc/cron.daily/acct to rotate it, then deleted old file, back to 39% disk usage [18:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:07] (nothing for this train window.) [18:04:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10200522 (10phaultfinder) [18:11:45] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200559 (10jhathaway) Our logs on our inbound postfix servers show the alias being applied correctly as well: ` 2024-10-03T13:42:05.863401+00:00 mx-in10... [18:15:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200572 (10jhathaway) >>! In T375643#10200317, @nisrael wrote: > @jhathaway I can attempt to do this, but I don't have access to this inbox and it's gett... [18:28:37] !log depool dns1005 for all services for testing T344171 [18:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:40] T344171: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171 [18:31:34] 06SRE, 10conftool: conftool socket timeout on IRC logging - https://phabricator.wikimedia.org/T376416 (10ssingh) 03NEW [18:31:52] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200651 (10Dzahn) Maybe you can also just send the fundraising email to our inboxes, like treat as if we were the normal recipients. [18:37:44] (03CR) 10Dzahn: [C:03+1] "looks good to me! (once it has approval)" [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila Součková) [18:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:39:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10200668 (10Dzahn) >>! In T376034#10199354, @kamila wrote: > @Dzahn it appears that Sean is already in the LDAP groups, unles... [18:39:54] (03CR) 10Dzahn: ncredir: Add enwp.org/c.enwp.org redirection (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [18:43:26] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [18:43:47] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [18:51:25] (03PS1) 10Ryan Kemper: wdqs.categories-reload: don't check host [cookbooks] - 10https://gerrit.wikimedia.org/r/1077777 (https://phabricator.wikimedia.org/T375687) [18:51:26] !log cmooney@cumin1002 conftool action : set/pooled=yes; selector: name=dns1005.wikimedia.org [reason: testing T344171] [18:51:28] T344171: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171 [19:00:30] brennen: ok if i deploy a new scap release? [19:02:43] 06SRE, 06Infrastructure-Foundations, 06Traffic: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges - https://phabricator.wikimedia.org/T376291#10200713 (10cmooney) [19:04:59] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200718 (10nisrael) Oh sure I can do that! @Dzahn just sent you a test. Let me know if there's anyone else I should include. [19:05:43] think i'll go ahead with it. it contains a very small fix to the `build-images` subcommand [19:05:50] !log dduvall@deploy2002 Installing scap version "4.109.0" for 210 hosts [19:06:46] (03CR) 10Bking: [C:03+2] wdqs.categories-reload: don't check host [cookbooks] - 10https://gerrit.wikimedia.org/r/1077777 (https://phabricator.wikimedia.org/T375687) (owner: 10Ryan Kemper) [19:09:23] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [19:09:29] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [19:10:18] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10200735 (10cmooney) >>! In T374713#10199236, @aborrero wrote: > Created: Thanks! I've made some minor edits to them in Netbox btw, just some things... [19:13:40] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.148`. Pre-deploy tests passing on canary `wdqs1016` [19:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:49] (03PS15) 10BCornwall: varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) [19:13:50] (03CR) 10BCornwall: varnish: Give 1% of views RSA cert warnings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:14:04] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@a3efe93]: 0.3.148 [19:14:35] !log [WDQS Deploy] Tests passing following deploy of `0.3.148` on canary `wdqs1016`; proceeding to rest of fleet [19:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:55] (03PS1) 10Dzahn: gerrit: make it possible to not bind the service IP on a gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) [19:15:14] (03CR) 10CI reject: [V:04-1] gerrit: make it possible to not bind the service IP on a gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [19:16:51] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [19:18:02] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [19:18:08] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [19:18:41] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1076167 (owner: 10Muehlenhoff) [19:18:48] (03CR) 10BCornwall: [V:03+1] "varnishtests passing" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:20:41] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200750 (10jhathaway) please send me one as well, thanks [19:22:38] dduvall: sorry i missed yr ping - all good by me [19:22:46] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@a3efe93]: 0.3.148 (duration: 08m 42s) [19:23:44] (03PS2) 10Dzahn: gerrit: make it possible to not bind the service IP on a gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) [19:25:12] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [19:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:22] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [19:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:33] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@a3efe93] (wcqs): Deploy 0.3.148 to WCQS [19:25:38] (03CR) 10CI reject: [V:04-1] gerrit: make it possible to not bind the service IP on a gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [19:27:39] (03PS3) 10Dzahn: gerrit: make it possible to not bind the service IP on a gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) [19:28:35] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@a3efe93] (wcqs): Deploy 0.3.148 to WCQS (duration: 03m 02s) [19:35:31] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [19:36:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.categories-reload (exit_code=99) reloading categories to wdqs-categories1001.eqiad.wmnet [19:42:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1022.eqiad.wmnet with OS bullseye [19:42:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10200796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aqs1022.eqiad.wmnet with OS bullseye [19:46:07] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074495 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [19:46:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:46:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:47:08] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: add migration release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074495 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [19:48:19] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [19:49:04] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.categories-reload (exit_code=99) reloading categories to wdqs-categories1001.eqiad.wmnet [19:49:46] (03PS7) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) [19:50:02] (03CR) 10BCornwall: ncredir: Add enwp.org/c.enwp.org redirection (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:50:46] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [19:50:52] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:51:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.categories-reload (exit_code=99) reloading categories to wdqs-categories1001.eqiad.wmnet [19:53:03] (03CR) 10Cwhite: [C:03+2] logstash: put logging-hd200[4-5] in service [puppet] - 10https://gerrit.wikimedia.org/r/1077498 (https://phabricator.wikimedia.org/T375447) (owner: 10Cwhite) [19:53:13] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [19:56:01] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T2000). [20:00:05] derenrich and aharoni: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:02:07] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [20:02:26] Hallo! I'm here for a backport. [20:02:48] o/ i can deploy today if the usual suspects are afk. [20:02:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.categories-reload (exit_code=99) reloading categories to wdqs-categories1001.eqiad.wmnet [20:03:02] derenrich: around? [20:04:07] (03CR) 10Dzahn: [C:03+1] ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [20:04:07] (03CR) 10Brennen Bearnes: [C:03+2] Update jquery.ime from upstream [extensions/UniversalLanguageSelector] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077734 (owner: 10Amire80) [20:04:23] aharoni: will get yours going through CI [20:04:36] thanks <3 [20:10:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077734 (owner: 10Amire80) [20:12:09] (derenrich is deferring his patch 'til next week, so this is the only thing for the window at the moment.) [20:15:56] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200860 (10nisrael) @jhathaway done! [20:18:15] If I recall correctly, I'm supposed to test the deployment using the browser extension at some point. Which domain should I select there? Something with k8s or mwdebug? [20:23:32] i believe k8s-mwdebug will work at this point. (once synced.) [20:24:36] Cool. I'm patiently waiting for you to tell me to tes.t [20:27:03] yep, just waiting on CI. will ping. [20:29:17] brennen: is there room for another backport in this window? [20:29:47] cscott: probably if we get it going right now... [20:30:06] jouncebot nowandnext [20:30:06] For the next 0 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241003T2000) [20:30:06] In 9 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241004T0600) [20:30:19] i don't mind going a bit over on this window. [20:30:19] (03PS1) 10C. Scott Ananian: RefreshLinksJob: Fix exception due to null/false confusion (take 2) [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077792 [20:30:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077792 (owner: 10C. Scott Ananian) [20:31:22] (03CR) 10Brennen Bearnes: [C:03+2] RefreshLinksJob: Fix exception due to null/false confusion (take 2) [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077792 (owner: 10C. Scott Ananian) [20:32:01] brennen: https://gerrit.wikimedia.org/r/c/1077792/ i just added it to the wiki [20:32:11] yep, +2'd [20:32:15] you're so fast [20:32:20] heh, sometimes [20:33:47] (03CR) 10C. Scott Ananian: "Clearing my C-2 since the required dependency is being backported as I write this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [20:34:43] (03Merged) 10jenkins-bot: Update jquery.ime from upstream [extensions/UniversalLanguageSelector] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077734 (owner: 10Amire80) [20:34:59] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1077734|Update jquery.ime from upstream]] [20:35:05] we could do the mediawiki-config tweak ^ above as well if there's time, or that can wait until monday. [20:35:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [20:36:40] cscott: let's see how we land with the 2 things currently in queue [20:36:48] brennen my backport patch, to which you gave +2, is merged [20:36:51] works for me. [20:37:03] !log brennen@deploy2002 brennen, amire80: Backport for [[gerrit:1077734|Update jquery.ime from upstream]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:07] aharoni: yeah, it's syncing as we speak, should be ready for test... now [20:37:51] brennen: i'm going to add the config change to the backport window on wiki, just so the commands etc are handy, but if we're short on time i'll just move it to monday. [20:38:09] kk [20:38:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [20:39:36] aharoni: let me know when you're good to continue. [20:39:42] good to continue [20:39:50] cool, going ahead [20:39:52] !log brennen@deploy2002 brennen, amire80: Continuing with sync [20:44:25] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077734|Update jquery.ime from upstream]] (duration: 09m 25s) [20:46:29] cscott: i'll plan on doing the config change - is this something that could be checked and rolled out simultaneously with the core patch? [20:47:07] without the core patch the metrics will trigger an exception in RefreshLinksJob when the page it is trying to fetch is not previously in the cache. [20:47:23] So if "simultaneous" really means "simultaneous" then they could be rolled out together [20:47:38] but if there are some wikis which could see the config change before the code change then they should probably be done sequentially. [20:48:44] it wouldn't be a user-visible crash, since it is in refreshlinksjob, but in theory some metadata could fail to get updated during the brief period of time things aren't consistent. I could probably find the crashes in the logs and manually action=purge them... but it would be less work for me if i didn't have to. :) [20:48:53] it'd be fine, i'm pretty sure, but from an abundance of not wanting to complicate anything, we can just do the core patch first. the config one should be pretty quick. [20:50:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077792 (owner: 10C. Scott Ananian) [20:55:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [20:56:16] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1022.eqiad.wmnet with OS bullseye [20:56:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10200914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aqs1022.eqiad.wmnet with OS bullseye executed with errors: - aqs1... [20:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:00:33] (03Merged) 10jenkins-bot: RefreshLinksJob: Fix exception due to null/false confusion (take 2) [core] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1077792 (owner: 10C. Scott Ananian) [21:00:50] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1077792|RefreshLinksJob: Fix exception due to null/false confusion (take 2)]] [21:02:53] !log brennen@deploy2002 cscott, brennen: Backport for [[gerrit:1077792|RefreshLinksJob: Fix exception due to null/false confusion (take 2)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:03:31] I can purge some pages and watch to see that there aren't any crashes in the logs, but it's not much of a test. [21:03:46] it wouldn't be crashing w/o the config change anyway [21:04:32] cscott: i defer to you on confidence for going ahead. :) [21:05:14] i'm going to look at the exception log just for due dilligence [21:05:25] kk [21:06:12] i don't see anything, ok to go ahead [21:06:29] !log brennen@deploy2002 cscott, brennen: Continuing with sync [21:07:08] (03CR) 10Brennen Bearnes: [C:03+2] Turn on Parsoid Selective Update metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [21:07:20] getting the config change going while we're waiting on this to finish. [21:07:30] wfm [21:07:52] (03Merged) 10jenkins-bot: Turn on Parsoid Selective Update metrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077460 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [21:11:00] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077792|RefreshLinksJob: Fix exception due to null/false confusion (take 2)]] (duration: 10m 09s) [21:13:05] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1077460|Turn on Parsoid Selective Update metrics (T371713)]] [21:13:07] T371713: Instrumentation & data gathering to inform future performance & templating improvements - https://phabricator.wikimedia.org/T371713 [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:07] !log brennen@deploy2002 cscott, brennen: Backport for [[gerrit:1077460|Turn on Parsoid Selective Update metrics (T371713)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:32] ok, i can watch to see the stats start showing up, hopefully [21:18:06] cscott: just say when. [21:21:34] i don't see stats showing up yet, but with a 1-in-1000 sampling rate and looking only at the testservers, i'm not sure how long i'd expect that to take before it show up in thanos autocomplete. [21:22:57] yeah, i guess just go ahead and you should see in prod pretty quickly? [21:23:09] yeah, i checked the error logs and at least nothing is broken [21:23:17] so go ahead i guess [21:23:50] kk [21:23:53] !log brennen@deploy2002 cscott, brennen: Continuing with sync [21:25:27] it just showed up :) [21:25:43] total count = 1 [21:26:07] cool [21:26:16] now we're up to 3 events. yay. [21:26:52] really what this means is that i can let is collect data over the weekend and come in to monday and start to analyze it. :) [21:27:12] rate was deliberately set pretty low [21:27:59] good deal. [21:28:05] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200971 (10jhathaway) @nisrael there is nothing obvious that I see in the email that would indicate why replies are arriving at `lisa@wikimedia.org`. The... [21:28:36] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077460|Turn on Parsoid Selective Update metrics (T371713)]] (duration: 15m 30s) [21:28:38] !log end of UTC late backport & config window [21:28:38] T371713: Instrumentation & data gathering to inform future performance & templating improvements - https://phabricator.wikimedia.org/T371713 [21:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:21] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10200976 (10cscott) People are perhaps manually cut-and-pasting the name from the "from" header, instead of using reply-to? [21:29:39] brennen: thanks so much for going over the window for me [21:30:08] happy to help. pretty quiet one today otherwise, so not exactly a high stress situation. :) [21:32:49] (03PS10) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [21:33:11] (03CR) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [21:38:09] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10201015 (10jhathaway) >>! In T375643#10200976, @cscott wrote: > People are perhaps manually cut-and-pasting the name from the "from" header, instead of u... [21:38:59] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [21:39:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.categories-reload (exit_code=99) reloading categories to wdqs-categories1001.eqiad.wmnet [21:45:43] (03PS1) 10Arlolra: scandium is being replaced by parsoidtest1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077800 (https://phabricator.wikimedia.org/T363402) [22:07:38] cscott: hrm, i'm noticing some new `Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given` since that deploy. [22:07:42] (roughly, anyway) [22:08:11] huh, that's interesting. [22:08:20] can you point me to a stack trace? [22:08:25] yeah, one sec [22:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201095 (10phaultfinder) [22:09:55] The cause is probably `$timeStat->incrementBy( $output->getTimeProfile( 'cpu' ) );` I hadn't considered that `getTimeProfile` could return null, but I'm guessing that's what's happening? [22:10:06] cscott: filed T376433 [22:10:07] T376433: TypeError: Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given - https://phabricator.wikimedia.org/T376433 [22:11:10] probably the quickest fix to deploy is to just back out the mediawiki-config change and turn the stats off? [22:12:17] I probably should investigate and figure out what time parses aren't getting timed, and i'd probably want to fix that rather that distort the stats by not including times for "those parses" (whichever they are), and that doesn't seem like a quick fix. [22:12:36] yeah, a revert would be quick here [22:12:40] for mw-config [22:13:29] (03PS1) 10Brennen Bearnes: Revert "Turn on Parsoid Selective Update metrics" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077802 (https://phabricator.wikimedia.org/T376433) [22:13:37] cscott: ^ [22:14:09] (03CR) 10C. Scott Ananian: [C:03+1] Revert "Turn on Parsoid Selective Update metrics" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077802 (https://phabricator.wikimedia.org/T376433) (owner: 10Brennen Bearnes) [22:14:17] looks good to me [22:14:48] cool, going ahead with that [22:14:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077802 (https://phabricator.wikimedia.org/T376433) (owner: 10Brennen Bearnes) [22:15:24] noting that i see this both for PoolWorkArticleView.php and at a lower volume for ParserOutputAccess.php [22:15:32] (03Merged) 10jenkins-bot: Revert "Turn on Parsoid Selective Update metrics" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077802 (https://phabricator.wikimedia.org/T376433) (owner: 10Brennen Bearnes) [22:15:50] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1077802|Revert "Turn on Parsoid Selective Update metrics" (T376433)]] [22:15:53] T376433: TypeError: Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given - https://phabricator.wikimedia.org/T376433 [22:15:57] !log brennen@deploy2002 scap failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.43.0-wmf.25 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/restricted/m [22:15:57] ediawiki-multiversion-debug --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.109.0-1) (duration: 00m 07s) [22:16:31] oh good [22:18:22] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1077802|Revert "Turn on Parsoid Selective Update metrics" (T376433)]] [22:18:30] !log brennen@deploy2002 scap failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.43.0-wmf.25 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/restricted/m [22:18:30] ediawiki-multiversion-debug --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.109.0-1) (duration: 00m 07s) [22:23:00] (scap's having issues, working on that.) [22:24:10] (03PS1) 10Arlolra: scandium is being replaced by parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) [22:25:31] in terms of severity, it looks like these are being generated by folks using the REST HTML endpoint against wikidata article pages, which I don't think is officially supported; the REST HTML endpoint is only supposed to be used for wikitext content model. [22:26:11] so that explains the relatively low rate, and it should probably be considered not super high priority, although certainly worth a backport to fix. [22:28:56] yeah, mostly would just like to clear it out in terms of log noise [22:30:10] I look forward to learning what crazy thing Wikibase is doing in its implementation of ContentHandler::fillParserOutput() [22:30:49] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1077802|Revert "Turn on Parsoid Selective Update metrics" (T376433)]] [22:30:52] T376433: TypeError: Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given - https://phabricator.wikimedia.org/T376433 [22:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 856.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:32:54] !log brennen@deploy2002 brennen: Backport for [[gerrit:1077802|Revert "Turn on Parsoid Selective Update metrics" (T376433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:33:17] !log brennen@deploy2002 brennen: Continuing with sync [22:37:53] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077802|Revert "Turn on Parsoid Selective Update metrics" (T376433)]] (duration: 07m 04s) [22:37:56] T376433: TypeError: Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given - https://phabricator.wikimedia.org/T376433 [22:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:39:48] cscott: confirming that those errors stopped. [22:46:20] thanks! [22:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 827.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:48:03] (03PS1) 10JHathaway: WIP: Don't send the dhcp file to the debian installer [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 [22:53:11] (03CR) 10JHathaway: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [22:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:05] (03CR) 10JHathaway: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [22:57:53] (03CR) 10CI reject: [V:04-1] WIP: Don't send the dhcp file to the debian installer [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [23:06:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10201278 (10Jclark-ctr) [23:06:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10201279 (10Jclark-ctr) a:03Jclark-ctr [23:08:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10201280 (10Jclark-ctr) @Eevans Hey Eric. this server is failing. I believe it might be because it is not insetup role in site.pp file. can you assist with that putti... [23:30:07] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077495 (owner: 10TrainBranchBot) [23:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077812 [23:46:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:46:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed