[00:09:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077812 (owner: 10TrainBranchBot) [00:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201362 (10phaultfinder) [00:40:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:58:44] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10201389 (10Dzahn) @nisrael I think at this point we really need to see one of the mails that arrived at lisa@wikimedia.org to say anything more. Do you... [01:00:35] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376192#10201390 (10Dzahn) [01:00:36] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:00:48] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376192#10201394 (10Dzahn) →14Duplicate dup:03T376094 [01:01:59] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10201392 (10Dzahn) [01:36:14] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201433 (10phaultfinder) [02:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:46:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201471 (10phaultfinder) [04:07:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10201472 (10seanleong-WMDE) Hi @kamila, yup, I have read the user responsibilities. Thank you @Dzahn and @KFrancis. [04:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201473 (10phaultfinder) [04:44:52] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201476 (10phaultfinder) [04:45:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [04:47:57] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376445 (10phaultfinder) 03NEW [04:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:05:36] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:36:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241004T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10201554 (10phaultfinder) [06:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:49:50] (03CR) 10Arnaudb: "a few comments inlined, focused on TODOs" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [06:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:57:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4211/" [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241004T0700) [07:04:15] (03CR) 10Brouberol: [C:03+1] dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [07:04:54] !log import jenkins 2.462.3 to thirdparty/ci T376449 [07:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:53] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, gerrit or apache might fail if they cant bind to that address. But I just found `profile::gerrit::proxy` using the IPs. And the vari" [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [07:30:15] !log upgrading Jenkins on CI Jenkins [07:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:40] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [07:46:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:29] !log oblivian@puppetserver1001 conftool action : set/weight=1; selector: dc=eqiad,cluster=kubernetes,name=mw1439.eqiad.wmnet [07:51:44] !log oblivian@puppetserver1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=kubernetes,name=mw1439.eqiad.wmnet [08:17:32] (03PS1) 10Klausman: hiera: move S3 pseudo-secrets for ml-lab::gpu to the right place [labs/private] - 10https://gerrit.wikimedia.org/r/1077896 [08:35:26] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10201766 (10Gehel) p:05Triage→03Medium [08:43:20] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add new service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:45:07] (03Merged) 10jenkins-bot: wikidata-query-gui: add new service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:50:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:55:27] (03PS1) 10Slyngshede: C:ircstream add blackbox monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1077909 (https://phabricator.wikimedia.org/T376014) [08:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:58:43] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [08:59:12] (03PS1) 10Stevemunene: Change an-worker117[67] to use reuse partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) [09:02:09] (03PS2) 10Slyngshede: C:ircstream add blackbox monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1077909 (https://phabricator.wikimedia.org/T376014) [09:04:07] (03PS1) 10Giuseppe Lavagetto: git::replicated_local_repo fixes [puppet] - 10https://gerrit.wikimedia.org/r/1077915 [09:04:26] (03CR) 10Jelto: [C:03+2] "Deploy is stuck because apache and envoy both default to port 8080 and envoy fails with: `cannot bind '[::]:8080': Address already in use`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077322 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:04:45] (03PS1) 10Klausman: admin/data: Add user group for access to ml-lab machines [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) [09:06:05] (03PS2) 10Klausman: admin/data: Add user group for access to ml-lab machines [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) [09:06:48] (03PS3) 10Klausman: admin/data: Add user group for access to ml-lab machines [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) [09:07:01] (03CR) 10Elukey: [C:03+1] C:ircstream add blackbox monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1077909 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:09:14] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:10:36] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:11:38] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [09:11:41] (03CR) 10Slyngshede: [C:03+2] C:ircstream add blackbox monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1077909 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:12:26] (03CR) 10Klausman: [C:03+2] admin/data: Add user group for access to ml-lab machines [puppet] - 10https://gerrit.wikimedia.org/r/1077914 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [09:14:58] (03CR) 10Giuseppe Lavagetto: [C:03+2] git::replicated_local_repo fixes [puppet] - 10https://gerrit.wikimedia.org/r/1077915 (owner: 10Giuseppe Lavagetto) [09:15:25] FIRING: SystemdUnitFailed: prometheus-puppet-agent-stats.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:31] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database moswiki (T375568) [09:15:34] T375568: Prepare and check storage layer for moswiki - https://phabricator.wikimedia.org/T375568 [09:15:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database moswiki (T375568) [09:16:08] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database madwiktionary (T375023) [09:16:11] T375023: Prepare and check storage layer for madwiktionary - https://phabricator.wikimedia.org/T375023 [09:16:19] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database madwiktionary (T375023) [09:16:53] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database gorwikiquote (T375094) [09:16:55] T375094: Prepare and check storage layer for gorwikiquote - https://phabricator.wikimedia.org/T375094 [09:17:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database gorwikiquote (T375094) [09:17:27] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database kgewiki (T374814) [09:17:29] T374814: Prepare and check storage layer for kgewiki - https://phabricator.wikimedia.org/T374814 [09:17:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database kgewiki (T374814) [09:18:08] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database shnwikinews (T375432) [09:18:11] T375432: Prepare and check storage layer for shnwikinews - https://phabricator.wikimedia.org/T375432 [09:19:40] 06SRE, 07SRE-Unowned: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053#10201849 (10MoritzMuehlenhoff) [09:20:50] (03PS5) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [09:20:55] (03CR) 10Ayounsi: redfish: add UEFI functions (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [09:22:29] 06SRE, 07SRE-Unowned: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053#10201853 (10MoritzMuehlenhoff) 05Open→03Declined The legacy irc.wikimedia.org setup is being phased out in favour of https://wikitech.wikimedia.org/wiki/Ircstream [09:22:33] 06SRE, 07SRE-Unowned: Move ircecho config file to be YAML - https://phabricator.wikimedia.org/T95054#10201855 (10MoritzMuehlenhoff) [09:23:07] 06SRE, 07SRE-Unowned: Move ircecho config file to be YAML - https://phabricator.wikimedia.org/T95054#10201857 (10MoritzMuehlenhoff) 05Open→03Declined The legacy irc.wikimedia.org setup is being phased out in favour of https://wikitech.wikimedia.org/wiki/Ircstream [09:23:36] (03PS19) 10Klausman: hiera/modules: Add ML Lab machine roles and config [puppet] - 10https://gerrit.wikimedia.org/r/1077710 [09:24:13] (03PS20) 10Klausman: hiera/modules: Add ML Lab machine roles and config [puppet] - 10https://gerrit.wikimedia.org/r/1077710 [09:25:57] (03PS1) 10Jelto: wikidata-query-gui: fix port already in use issue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077919 (https://phabricator.wikimedia.org/T350793) [09:32:35] (03CR) 10CI reject: [V:04-1] redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [09:35:11] !log upload ircstream 0.13.0+wmf12u1 to apt.wikimedia.org T376014 [09:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:15] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [09:36:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:42:38] (03PS1) 10Brouberol: Provision capabilities for all Ceph mds services [puppet] - 10https://gerrit.wikimedia.org/r/1077923 (https://phabricator.wikimedia.org/T376402) [09:43:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database shnwikinews (T375432) [09:43:55] T375432: Prepare and check storage layer for shnwikinews - https://phabricator.wikimedia.org/T375432 [09:45:25] RESOLVED: SystemdUnitFailed: prometheus-puppet-agent-stats.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:47] (03CR) 10Elukey: "Left some notes, lemme know what you think about them! Also please add tests :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [09:54:16] (03PS2) 10Brouberol: Provision capabilities for all Ceph mds services [puppet] - 10https://gerrit.wikimedia.org/r/1077923 (https://phabricator.wikimedia.org/T376402) [09:55:28] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4221/console" [puppet] - 10https://gerrit.wikimedia.org/r/1077923 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [09:57:10] (03PS1) 10Brouberol: Provision dummy keys for cephosd mds servers [labs/private] - 10https://gerrit.wikimedia.org/r/1077926 [09:58:29] (03PS2) 10Brouberol: Provision dummy keys for cephosd mds servers [labs/private] - 10https://gerrit.wikimedia.org/r/1077926 (https://phabricator.wikimedia.org/T376402) [10:00:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:01:39] (03CR) 10Btullis: [C:03+1] Provision dummy keys for cephosd mds servers [labs/private] - 10https://gerrit.wikimedia.org/r/1077926 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [10:03:31] (03CR) 10Brouberol: [C:03+2] Provision dummy keys for cephosd mds servers [labs/private] - 10https://gerrit.wikimedia.org/r/1077926 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [10:03:33] (03CR) 10Brouberol: [V:03+2 C:03+2] Provision dummy keys for cephosd mds servers [labs/private] - 10https://gerrit.wikimedia.org/r/1077926 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [10:07:30] !log upload ircstream 0.13.0+sse12u1 to apt.wikimedia.org bookworm/ircstream-sse component (seperate build using the experimental eventstream feature branch of ircstream) T376014 [10:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:32] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [10:09:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1004.wikimedia.org [10:12:59] (03PS1) 10Giuseppe Lavagetto: git::replicated_local_repo: further changes to post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1077931 [10:13:11] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077923 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [10:13:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1004.wikimedia.org [10:13:18] (03CR) 10CI reject: [V:04-1] git::replicated_local_repo: further changes to post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1077931 (owner: 10Giuseppe Lavagetto) [10:15:15] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1077923 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [10:15:26] (03CR) 10Brouberol: [V:03+1 C:03+2] Provision capabilities for all Ceph mds services [puppet] - 10https://gerrit.wikimedia.org/r/1077923 (https://phabricator.wikimedia.org/T376402) (owner: 10Brouberol) [10:16:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2004.wikimedia.org [10:19:45] (03PS2) 10Giuseppe Lavagetto: git::replicated_local_repo: further changes to post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1077931 [10:19:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2004.wikimedia.org [10:22:33] (03CR) 10Giuseppe Lavagetto: [C:03+2] git::replicated_local_repo: further changes to post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1077931 (owner: 10Giuseppe Lavagetto) [10:30:04] (03PS1) 10Muehlenhoff: ircstream: No longer install python3-aiohttp-sse-client [puppet] - 10https://gerrit.wikimedia.org/r/1077932 (https://phabricator.wikimedia.org/T376014) [10:31:49] (03PS1) 10Elukey: role::docker_registry_ha::registry: add nginx monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1077933 (https://phabricator.wikimedia.org/T376285) [10:32:51] (03PS6) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [10:33:16] (03CR) 10Ayounsi: "Don't worry this won't get merged without tests. But I prefer to write them once we all agree on the actual code. FYI, this is still WIP." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:33:26] (03PS2) 10Elukey: role::docker_registry_ha::registry: add nginx monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1077933 (https://phabricator.wikimedia.org/T376285) [10:35:45] (03CR) 10Elukey: "Still WIP, need more work :)" [puppet] - 10https://gerrit.wikimedia.org/r/1077933 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [10:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:39:41] (03CR) 10Dreamy Jazz: scap: Add a deprecation warning to classic mwscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [10:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10202058 (10phaultfinder) [10:40:22] (03CR) 10Ayounsi: WIP: add efi support to partman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [10:40:34] (03CR) 10Dreamy Jazz: scap: Add a deprecation warning to classic mwscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [10:46:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikimedia-Portals, 13Patch-For-Review: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10202079 (10simon04) a:05Punith.nyk→03None @TheDJ in https://gerrit.wikimedia.org/r/910766 >... [10:47:43] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [10:48:06] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [10:52:30] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: dns/netbox: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462 (10aborrero) 03NEW [10:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:55] FIRING: [3x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:08] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [10:57:15] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [10:57:23] (03CR) 10Elukey: [C:03+1] "Upstream is definitely too active for our pace :D" [puppet] - 10https://gerrit.wikimedia.org/r/1077932 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [10:58:17] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10202117 (10elukey) Revamped a bit https://grafana-rw.wikimedia.org/d/StcefURWz/docker-registry, and I noticed that Swift P50 late... [10:59:11] 06SRE, 10iPoid-Service, 13Patch-For-Review: Increase in connection timeouts on ipoid-production - https://phabricator.wikimedia.org/T375006#10202112 (10jijiki) I have looked into this a little bit, but sadly I have not found yet anythat could be causing this. @TK-999 let us know if your patch improves the issue [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241004T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: Your horoscope predicts another GitLab version upgrades deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241004T1100). [11:04:05] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: dns/netbox: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10202145 (10cmooney) This patch covers the delegation for the openstack-managed ranges, I think it's correct? https://g... [11:06:20] (03PS6) 10Brouberol: Deploy a Ceph MDS service on each cephosd server [puppet] - 10https://gerrit.wikimedia.org/r/1077936 (https://phabricator.wikimedia.org/T376404) [11:06:55] FIRING: [4x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:55] FIRING: [4x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:00] (03CR) 10Muehlenhoff: "This might work. Can't provide a more authoritative answer, Partman has too many possible unforseen side affects. Instead we could also me" [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [11:14:20] (03CR) 10Muehlenhoff: [C:03+2] ircstream: No longer install python3-aiohttp-sse-client [puppet] - 10https://gerrit.wikimedia.org/r/1077932 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [11:14:24] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10202157 (10aborrero) [11:15:33] (03CR) 10Btullis: Change an-worker117[67] to use reuse partman recipe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [11:16:06] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10202161 (10cmooney) To be more clear, you need to make sure these two zones are working on the openstack authdns: ` 0.0.0.0.0.... [11:18:39] (03CR) 10Stevemunene: Change an-worker117[67] to use reuse partman recipe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [11:23:54] (03CR) 10Btullis: Change an-worker117[67] to use reuse partman recipe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [11:25:35] (03Abandoned) 10Samtar: Initial configuration for amicalwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/892575 (https://phabricator.wikimedia.org/T330390) (owner: 10Samtar) [11:25:41] (03Abandoned) 10Samtar: Add Apache configuration for amical.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/892555 (https://phabricator.wikimedia.org/T330390) (owner: 10Samtar) [11:46:46] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@9096f1b] (releasing): (no justification provided) [11:47:27] (03PS1) 10Slyngshede: C:ircstream move SSE hosts to internal endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1077941 (https://phabricator.wikimedia.org/T376014) [11:47:31] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@9096f1b] (releasing): (no justification provided) (duration: 00m 47s) [11:52:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1077941 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:54:48] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10202237 (10aborrero) before and after merging the tofu-infra patch above: `lang=shell-session arturo@nostromo:~ $ dig SOA 0... [11:55:29] (03CR) 10Slyngshede: [C:03+2] C:ircstream move SSE hosts to internal endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1077941 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:55:42] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "per https://phabricator.wikimedia.org/T376462#10202237 we may proceed now with this change." [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [11:59:46] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@9096f1b] (releasing): (no justification provided) [12:00:14] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10202238 (10aborrero) 05Open→03In progress p:05Triage→03Medium [12:00:57] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@9096f1b] (releasing): (no justification provided) (duration: 01m 13s) [12:14:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077943 [12:14:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077943 (owner: 10TrainBranchBot) [12:15:17] (03CR) 10Btullis: [C:03+1] "Looks good to me. Let's try it." [puppet] - 10https://gerrit.wikimedia.org/r/1077936 (https://phabricator.wikimedia.org/T376404) (owner: 10Brouberol) [12:18:32] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [12:27:57] (03CR) 10Brouberol: [V:03+1 C:03+2] Deploy a Ceph MDS service on each cephosd server [puppet] - 10https://gerrit.wikimedia.org/r/1077936 (https://phabricator.wikimedia.org/T376404) (owner: 10Brouberol) [12:38:48] (03PS1) 10Brouberol: ceoh: specify the keyring path for the cephosd MDS servers [puppet] - 10https://gerrit.wikimedia.org/r/1077944 (https://phabricator.wikimedia.org/T376404) [12:39:38] (03PS2) 10Brouberol: ceph: specify the keyring path for the cephosd MDS servers [puppet] - 10https://gerrit.wikimedia.org/r/1077944 (https://phabricator.wikimedia.org/T376404) [12:41:37] (03PS3) 10Brouberol: ceph: specify the keyring path for the cephosd MDS servers [puppet] - 10https://gerrit.wikimedia.org/r/1077944 (https://phabricator.wikimedia.org/T376404) [12:42:30] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4226/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077944 (https://phabricator.wikimedia.org/T376404) (owner: 10Brouberol) [12:43:03] (03CR) 10Btullis: [C:03+1] ceph: specify the keyring path for the cephosd MDS servers [puppet] - 10https://gerrit.wikimedia.org/r/1077944 (https://phabricator.wikimedia.org/T376404) (owner: 10Brouberol) [12:43:18] (03CR) 10Brouberol: [V:03+1 C:03+2] ceph: specify the keyring path for the cephosd MDS servers [puppet] - 10https://gerrit.wikimedia.org/r/1077944 (https://phabricator.wikimedia.org/T376404) (owner: 10Brouberol) [12:47:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1077943 (owner: 10TrainBranchBot) [12:55:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:58:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10202329 (10kamila) 05Open→03In progress Thanks for confirming @seanleong-WMDE, I will finalise this as soon as the group... [13:01:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10202351 (10kamila) 05Open→03In progress a:03kamila [13:03:48] (03CR) 10Btullis: [C:03+1] "Feel free to try it out. It might work." [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:09:20] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10202360 (10MBywater-WMF) lisa@wikimedia.org is an alias of lgruwell@wikimedia.org [13:09:35] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10202361 (10kamila) [13:11:51] (03CR) 10Ayounsi: "That looks fine to me!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [13:15:36] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:19:40] !log ayounsi@cumin1002 START - Cookbook sre.hosts.dhcp for host sretest2001.codfw.wmnet [13:26:10] (03CR) 10Ayounsi: WIP: Don't send the dhcp file to the debian installer (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [13:27:16] (03PS1) 10Ladsgroup: tables-catalog: Add wikibase repo tables [puppet] - 10https://gerrit.wikimedia.org/r/1077948 (https://phabricator.wikimedia.org/T363581) [13:30:50] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add wikibase repo tables [puppet] - 10https://gerrit.wikimedia.org/r/1077948 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:32:51] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [13:33:56] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.categories-reload (exit_code=97) reloading categories to wdqs-categories1001.eqiad.wmnet [13:36:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:54] (03PS1) 10Kamila Součková: data.yaml: add user for wrai [puppet] - 10https://gerrit.wikimedia.org/r/1077952 (https://phabricator.wikimedia.org/T376298) [13:50:42] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10202535 (10Papaul) [13:52:58] (03CR) 10Elukey: "+1 I agree, this should probably work but I'll keep it as opt-in as first step." [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [13:54:59] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs-categories1001.eqiad.wmnet [13:58:12] (03CR) 10Bking: [C:03+2] dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [13:59:11] (03Merged) 10jenkins-bot: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) (owner: 10Bking) [14:00:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:12:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1077952 (https://phabricator.wikimedia.org/T376298) (owner: 10Kamila Součková) [14:20:13] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10202600 (10nisrael) Ok good to know. I will set up a call with Lisa next Monday and we can get an example of one of the mails. She has access to that inbox. [14:28:37] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@4b69f50]: add category to commons impact metrics allowlist [14:29:28] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@4b69f50]: add category to commons impact metrics allowlist (duration: 01m 48s) [14:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:38:39] (03PS1) 10Herron: add links to SLOs migrated to pyrra [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1077966 [14:39:58] (03CR) 10Kamila Součková: [C:03+2] data.yaml: add user for wrai [puppet] - 10https://gerrit.wikimedia.org/r/1077952 (https://phabricator.wikimedia.org/T376298) (owner: 10Kamila Součková) [14:41:18] (03CR) 10Samtar: [C:03+1] "lgtm, can be scheduled for a backport next week :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069293 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [14:43:43] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10202667 (10MBywater-WMF) Thanks, @nisrael! [14:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10202668 (10phaultfinder) [14:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:09] (03PS2) 10Herron: add links to SLOs migrated to pyrra [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1077966 [15:01:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:14] (03CR) 10Elukey: "I'd need to better understand the use case, sorry for the silly question - could we force/specify the filename in the cookbook rather than" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [15:11:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:58] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for wrai - https://phabricator.wikimedia.org/T376298#10202736 (10kamila) 05In progress→03Resolved Done, let me know in case there are any problems. [15:15:59] (03CR) 10JHathaway: "When provisioning a server we have at least two dhcp sequences:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [15:20:06] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [15:23:06] (03CR) 10JHathaway: WIP: Don't send the dhcp file to the debian installer (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [16:00:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.categories-reload (exit_code=99) reloading categories to wdqs-categories1001.eqiad.wmnet [16:13:27] (03CR) 10Cwhite: [C:03+2] logstash: put logging-hd100[4-5] in service [puppet] - 10https://gerrit.wikimedia.org/r/1077499 (https://phabricator.wikimedia.org/T375447) (owner: 10Cwhite) [16:21:20] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest2001.codfw.wmnet [16:27:04] (03PS1) 10Tiziano Fogli: kafka: port mirror maker alerts to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [16:28:15] (03CR) 10CI reject: [V:04-1] kafka: port mirror maker alerts to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [16:53:49] (03CR) 10Elukey: "I see thanks for the explanation. So IIUC we need something dynamic that returns a filename or the other depending on what the DHCP client" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [16:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:00:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:02:32] (03CR) 10JHathaway: "> I see thanks for the explanation. So IIUC we need something dynamic that returns a filename or the other depending on what the DHCP clie" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [17:04:17] (03PS2) 10Tiziano Fogli: kafka: port mirror maker alerts to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [17:04:59] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10203183 (10elukey) Last day of hackathon (5th day): * irc[12]004 has been updated with... [17:20:36] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:36:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376445#10203311 (10Dzahn) [17:44:52] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376445#10203315 (10Dzahn) →14Duplicate dup:03T376094 [17:46:03] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10203313 (10Dzahn) [17:47:13] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10203317 (10Dzahn) This is the third ticket now that was created for the same issue. There should probably be something to prevent that from happening. One ticket until it's resolved woul... [18:00:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [18:01:29] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1077781/4228/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:02:29] (03PS1) 10RLazarus: deployment_server: Pass $HOME when mwscript_k8s shells out to kubectl [puppet] - 10https://gerrit.wikimedia.org/r/1077992 (https://phabricator.wikimedia.org/T341553) [18:09:05] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on all 3 machines in prod" [puppet] - 10https://gerrit.wikimedia.org/r/1077781 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:26:20] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10203441 (10Eevans) [18:29:42] 06SRE-OnFire, 10Incident Tooling: Ensure that Phabricator tasks are created with the correct default visibility and priority - https://phabricator.wikimedia.org/T376500 (10Eevans) 03NEW [18:30:56] 06SRE-OnFire, 10Incident Tooling: Corto: ensure Phabricator tasks are created with correct default visibility & priority - https://phabricator.wikimedia.org/T376500#10203455 (10Eevans) [18:33:00] 06SRE-OnFire, 10Incident Tooling: Corto: remove unused context.Context arguments - https://phabricator.wikimedia.org/T376501 (10Eevans) 03NEW [18:33:10] (03CR) 10Scott French: [C:03+1] "Wow, that is some wild behavior on kubectl's part, IMO. This seems like a fine solution, so LGTM! (and there doesn't appear to be any obvi" [puppet] - 10https://gerrit.wikimedia.org/r/1077992 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [18:33:52] 06SRE-OnFire, 10Incident Tooling: Corto internal incident response workflow automation (MVP) - https://phabricator.wikimedia.org/T356790#10203472 (10Eevans) [18:37:53] (03CR) 10C. Scott Ananian: scandium is being replaced by parsoidtest1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [18:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:46:52] (03CR) 10Arlolra: scandium is being replaced by parsoidtest1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [18:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10203510 (10phaultfinder) [18:53:03] (03CR) 10RLazarus: [C:03+2] deployment_server: Give mwscript-k8s --verbose more granular options [puppet] - 10https://gerrit.wikimedia.org/r/1077475 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [18:53:23] (03CR) 10RLazarus: [C:03+2] deployment_server: Pass $HOME when mwscript_k8s shells out to kubectl [puppet] - 10https://gerrit.wikimedia.org/r/1077992 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [18:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:26] (03PS3) 10Dzahn: gerrit: include gerrit profile in insetup::gerrit for testing [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) [18:55:27] (03CR) 10Dzahn: "should now be ok after Ia9639f1ef8f758f834d52e made it possible without IP conflict" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:55:57] and q2 tasks/orders are being escalated to sub teams today, wooo [18:56:03] bleh, wrong channel [18:56:28] (03PS4) 10Dzahn: gerrit: include gerrit profile in insetup::gerrit for testing [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) [19:02:20] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: sre.discovery.datacenter should support switching the active/passive services to the other datacenter - https://phabricator.wikimedia.org/T335364#10203543 (10Kappakayala) @Clement_Goubert: are we done with thi... [19:11:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:16] (03PS1) 10RLazarus: scap: Exclude mysql.php from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078003 (https://phabricator.wikimedia.org/T341553) [19:36:21] (03CR) 10RLazarus: [C:03+2] scap: Add a deprecation warning to classic mwscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [19:46:09] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1074477/4230/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [19:48:38] (03PS2) 10RLazarus: scap: Exclude mysql.php from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078003 (https://phabricator.wikimedia.org/T341553) [19:58:30] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#10203709 (10Eevans) 05Open→03Resolved a:05Jclark-ctr→03VRiley-WMF I waited a long time Just To Be Sure™, but there have been no subsequent failures; @VRiley-WMF you have the magic touch! [20:11:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:32] (03CR) 10Scott French: [C:03+1] scap: Exclude mysql.php from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078003 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [20:18:34] (03CR) 10RLazarus: [C:03+2] scap: Exclude mysql.php from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1078003 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [20:22:08] (03CR) 10Dreamy Jazz: scap: Add a deprecation warning to classic mwscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077450 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [20:30:21] (03PS2) 10Scott French: mw-debug: add initial "next" release (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072764 (https://phabricator.wikimedia.org/T372604) [20:30:22] (03PS1) 10Scott French: mw-debug: remove temporary release value override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078007 (https://phabricator.wikimedia.org/T372604) [20:50:19] (03CR) 10Dzahn: "Tbh, I don't know. This seems kind of controversial, highly unusual and the ticket is a couple years old." [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [20:57:55] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376511 (10phaultfinder) 03NEW [20:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:05:36] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:25:36] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:25:59] (03PS4) 10Scott French: sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) [21:25:59] (03PS3) 10Scott French: sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) [21:32:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10203997 (10Jclark-ctr) a:05Jclark-ctr→03Eevans [21:36:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:00:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:16:54] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10204075 (10Papaul) [22:20:29] (03CR) 10C. Scott Ananian: [C:03+1] scandium is being replaced by parsoidtest1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077800 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [22:29:00] (03PS2) 10Jdlrobson: DONOTMERGE: Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [22:30:59] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10204094 (10BCornwall) Okay, the patch has approval and will be merged if/when we get the domain into our MarkMonitor account - otherwise automation (ncmonitor) will be unhappy and try to remove it... [22:37:58] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:53:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10204163 (10phaultfinder) [22:55:29] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-Needs-Improvement, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10204155 (10BCornwall) @Dzahn You had hesitations on the Gerrit CR... [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078015 [23:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078015 (owner: 10TrainBranchBot) [23:45:43] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-Needs-Improvement, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10204219 (10Pppery) Although Alemannic already does that... [23:49:24] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-Needs-Improvement, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10204217 (10Pppery) ... but it's worth pointing out that bar.wikti... [23:54:50] (03PS6) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [23:58:30] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown