[00:00:48] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935137 [00:00:50] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935137 (owner: 10Zabe) [00:01:34] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935137 (owner: 10Zabe) [00:02:19] !log zabe@deploy1002 Started scap: update interwiki cache [00:09:11] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 06m 51s) [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935138 [00:38:51] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935138 (owner: 10TrainBranchBot) [00:54:54] (03PS1) 10Andrea Denisse: Move XHGui secrets to provision XHGui on performance.wikimedia.org Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935533 (https://phabricator.wikimedia.org/T340713) [00:56:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935138 (owner: 10TrainBranchBot) [01:28:00] (03PS2) 10Andrea Denisse: Move XHGui secrets to provision XHGui on performance.wikimedia.org Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935533 (https://phabricator.wikimedia.org/T340713) [01:28:26] (03PS2) 10Daimona Eaytoy: beta: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935463 (https://phabricator.wikimedia.org/T320258) [01:29:36] (03PS3) 10Andrea Denisse: Move XHGui secrets to provision XHGui on performance.wikimedia.org Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935533 (https://phabricator.wikimedia.org/T340713) [01:33:48] (03PS1) 10Andrea Denisse: Move XHGui secrets to provision XHGui on performance.wikimedia.org Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935534 (https://phabricator.wikimedia.org/T340713) [01:34:16] (03Abandoned) 10Andrea Denisse: Move XHGui secrets to provision XHGui on performance.wikimedia.org Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935533 (https://phabricator.wikimedia.org/T340713) (owner: 10Andrea Denisse) [01:37:25] (03PS2) 10Andrea Denisse: Add XHGui secrets to provision XHGui on the performance.wikimedia.org site Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935534 (https://phabricator.wikimedia.org/T340713) [01:39:54] (03CR) 10Krinkle: [C: 03+1] Add XHGui secrets to provision XHGui on the performance.wikimedia.org site Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935534 (https://phabricator.wikimedia.org/T340713) (owner: 10Andrea Denisse) [01:42:11] (03CR) 10Andrea Denisse: [C: 03+2] Add XHGui secrets to provision XHGui on the performance.wikimedia.org site Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935534 (https://phabricator.wikimedia.org/T340713) (owner: 10Andrea Denisse) [01:42:14] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] Add XHGui secrets to provision XHGui on the performance.wikimedia.org site Bug: T340713 [labs/private] - 10https://gerrit.wikimedia.org/r/935534 (https://phabricator.wikimedia.org/T340713) (owner: 10Andrea Denisse) [01:44:03] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [02:04:21] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:55:16] (03PS1) 10David Martin: Add performer_pageview_id & performer_is_bot to wikifunctions.ui stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935536 (https://phabricator.wikimedia.org/T338005) [04:15:27] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:00:24] (03PS3) 10Anzx: Enable Extension:RelatedArticles for desktop on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) [05:01:21] (03PS4) 10Anzx: Enable Extension:RelatedArticles for desktop on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) [05:03:03] (03PS5) 10Anzx: Enable Extension:RelatedArticles for desktop on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) [05:11:12] (03CR) 10Marostegui: [C: 03+1] mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [05:13:05] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T0600) [06:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:03] (03PS1) 10Marostegui: dbproxy1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935664 [06:41:56] (03CR) 10Marostegui: [C: 03+2] dbproxy1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935664 (owner: 10Marostegui) [06:54:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Redirect 0.5% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/935466 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [06:54:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Documentation, 10Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095 (10Aklapper) [06:58:49] (03PS1) 10Marostegui: dbproxy1015: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935665 [06:59:24] (03CR) 10Marostegui: [C: 03+2] dbproxy1015: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935665 (owner: 10Marostegui) [06:59:55] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [07:00:04] Amir1, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T0700). [07:00:04] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:05] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [07:06:30] (03CR) 10Ladsgroup: [C: 03+2] mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [07:09:18] (03Merged) 10jenkins-bot: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) (owner: 10Ladsgroup) [07:37:38] sigh, why it thinks I'm not on call? [07:40:03] Amir1: too early [07:40:12] your shift starts at 800 [07:40:20] aah, okay [07:40:26] I thought it's at 7:00 [07:40:29] anyway [07:40:58] 10SRE, 10ops-eqiad: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10Volans) [07:42:54] Amir1: did you change it since yesterday? I guess it had the time that alex used for the past 2 days [07:44:24] yeah, probably. I haven't changed it yet. Right now fighting with the VO login in my phone that decided to log me out right before my shift [07:45:13] (and only works with google SSO on web meaning I need to do the okta 2fa dance on my phone) [08:00:05] hashar and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T0800). [08:00:58] o/ [08:01:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:03:19] (03PS1) 10Marostegui: dbproxy1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935671 [08:04:13] (03PS1) 10Clément Goubert: mw-api-ext: Raise replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935672 (https://phabricator.wikimedia.org/T341078) [08:06:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [08:07:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-ext: Raise replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935672 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [08:07:46] (03CR) 10Clément Goubert: [C: 03+2] mw-api-ext: Raise replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935672 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [08:08:30] (03Merged) 10jenkins-bot: mw-api-ext: Raise replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935672 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [08:09:27] (03CR) 10Marostegui: [C: 03+2] dbproxy1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/935671 (owner: 10Marostegui) [08:09:56] (03CR) 10David Caro: cloudcumin: don't send logs to prod IRC (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [08:10:09] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10SLyngshede-WMF) [08:10:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [08:10:20] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10SLyngshede-WMF) p:05Triage→03Low [08:10:31] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [08:10:42] RECOVERY - Check systemd state on puppetboard1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:12] RECOVERY - Check systemd state on puppetboard2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:31] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10SLyngshede-WMF) p:05Triage→03Medium [08:12:28] (03CR) 10Michael Große: [C: 03+1] outreachwiki: Set wmgWikibaseSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935455 (owner: 10Lucas Werkmeister (WMDE)) [08:12:46] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [08:13:01] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [08:13:05] (03CR) 10Michael Große: [C: 03+1] foundationwiki: Enable WikibaseClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [08:19:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikikube: Switch to new IPv6 service ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/933100 (https://phabricator.wikimedia.org/T335285) (owner: 10JMeybohm) [08:19:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetdb; allow connections from puppetserver over ipv6 - https://phabricator.wikimedia.org/T340563 (10jbond) 05Open→03Invalid [08:19:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [08:20:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [08:21:40] PROBLEM - Host kubestagetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:01] (03PS1) 10Clément Goubert: mw-on-k8s: Revert sending traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/935673 (https://phabricator.wikimedia.org/T341078) [08:22:40] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:46] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:06] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:12] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "k8s: Configure the IPv6 service ip range for apiserver"" [puppet] - 10https://gerrit.wikimedia.org/r/933101 (owner: 10JMeybohm) [08:23:15] (03CR) 10JMeybohm: [C: 03+2] wikikube: Switch to new IPv6 service ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/933100 (https://phabricator.wikimedia.org/T335285) (owner: 10JMeybohm) [08:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:23:38] (03CR) 10Clément Goubert: "Merge patch in case of emergency to stop redirecting 0.5% of global traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/935673 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [08:25:16] I am running the train [08:25:28] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [08:25:30] RECOVERY - Host kubestagetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [08:25:40] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935674 (https://phabricator.wikimedia.org/T340244) [08:25:42] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [08:25:42] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935674 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [08:25:56] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [08:26:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [08:26:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [08:26:25] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935674 (https://phabricator.wikimedia.org/T340244) (owner: 10TrainBranchBot) [08:26:48] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [08:28:01] hashar: I'll wait for you to be done before flipping the mw-on-k8s switch then [08:28:05] ping me? [08:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:32:21] (03CR) 10Btullis: [C: 03+2] Temporarily disable gobblin jobs on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/935425 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [08:33:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet [08:34:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [08:34:37] 10SRE, 10envoy, 10serviceops: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) a:03JMeybohm [08:34:41] 10SRE, 10Traffic, 10envoy, 10serviceops: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) a:03JMeybohm [08:34:51] (03PS13) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [08:34:53] 10SRE, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) a:03JMeybohm [08:35:11] (03CR) 10CI reject: [V: 04-1] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:36:12] (03PS14) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [08:38:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-netflow.timer,gobblin-webrequest.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:25] (03PS2) 10Btullis: Temporarily disable the spark jobs that are running on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/935426 (https://phabricator.wikimedia.org/T332765) [08:39:27] (03PS2) 10Btullis: Upgrade the spark shuffler service from version 2 to version 3 [puppet] - 10https://gerrit.wikimedia.org/r/935427 (https://phabricator.wikimedia.org/T332765) [08:39:56] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:40:31] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:40:38] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:41:14] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42247/console" [puppet] - 10https://gerrit.wikimedia.org/r/935426 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [08:42:01] (03CR) 10Btullis: [V: 03+1 C: 03+2] Temporarily disable the spark jobs that are running on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/935426 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [08:42:11] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42248/console" [puppet] - 10https://gerrit.wikimedia.org/r/935427 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [08:42:32] (03CR) 10Fabfur: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:43:55] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.16 refs T340244 [08:44:04] T340244: 1.41.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T340244 [08:45:45] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:45:52] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:47:27] (03CR) 10Btullis: [V: 03+1 C: 04-1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42251/console" [puppet] - 10https://gerrit.wikimedia.org/r/935427 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [08:49:08] (03PS1) 10Effie Mouzeli: ipoid: fix app key in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/935675 [08:49:16] (03CR) 10CI reject: [V: 04-1] ipoid: fix app key in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/935675 (owner: 10Effie Mouzeli) [08:49:35] (03PS2) 10Effie Mouzeli: ipoid: fix app key in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/935675 [08:50:14] ah [08:50:16] scap failed [08:50:18] (03PS1) 10AikoChou: ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935676 (https://phabricator.wikimedia.org/T334182) [08:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST endpointslices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:50:55] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: fix app key in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/935675 (owner: 10Effie Mouzeli) [08:51:14] (03PS2) 10AikoChou: ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935676 (https://phabricator.wikimedia.org/T334182) [08:51:35] (03Merged) 10jenkins-bot: ipoid: fix app key in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/935675 (owner: 10Effie Mouzeli) [08:52:51] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:52:54] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [08:53:01] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:53:38] (03PS2) 10Stevemunene: Create spark3 local directory [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) [08:53:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet [08:54:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet [08:55:34] (KubernetesAPILatency) resolved: (16) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:56:28] (03CR) 10Btullis: [V: 03+1 C: 03+2] Upgrade the spark shuffler service from version 2 to version 3 [puppet] - 10https://gerrit.wikimedia.org/r/935427 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [08:58:25] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42252/console" [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [08:59:28] (03PS1) 10JMeybohm: envoy: Refactor max_requests_per_connection [puppet] - 10https://gerrit.wikimedia.org/r/935678 (https://phabricator.wikimedia.org/T304124) [08:59:33] (03PS1) 10JMeybohm: Add mesh.configuration 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935679 (https://phabricator.wikimedia.org/T300324) [08:59:35] (03PS1) 10JMeybohm: mesh.configuration: Refactor max_requests_per_connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/935680 (https://phabricator.wikimedia.org/T304124) [08:59:54] (03PS1) 10Effie Mouzeli: ipoid: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/935681 [09:00:49] (KubernetesAPILatency) firing: (44) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:00:56] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10SLyngshede-WMF) 05Open→03In progress [09:00:59] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:01:04] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:01:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet [09:01:10] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:01:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1013.eqiad.wmnet [09:01:25] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/935681 (owner: 10Effie Mouzeli) [09:02:04] (KubernetesAPILatency) firing: (44) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:02:16] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:02:22] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:02:27] (03Merged) 10jenkins-bot: ipoid: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/935681 (owner: 10Effie Mouzeli) [09:03:00] (03PS1) 10Effie Mouzeli: ipoid: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/935682 [09:03:51] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:04:24] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:04:31] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:05:10] (03PS1) 10JMeybohm: mesh.configuration: Remove tls_minimum_protocol_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935684 (https://phabricator.wikimedia.org/T337453) [09:05:12] (03PS1) 10JMeybohm: envoy: Remove tls_minimum_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) [09:05:50] (KubernetesAPILatency) resolved: (44) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:06] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/935682 (owner: 10Effie Mouzeli) [09:06:27] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:06:33] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:06:48] (03Merged) 10jenkins-bot: ipoid: update resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/935682 (owner: 10Effie Mouzeli) [09:07:13] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:07:26] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:07:32] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:07:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [09:07:48] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:08:19] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:08:22] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:08:23] (03PS15) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [09:08:38] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:08:41] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:09:09] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:09:16] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:10:13] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:10:18] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42254/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:10:20] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:10:56] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:11:03] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:12:33] (03PS1) 10Clément Goubert: admin_ng: Raise resource limits for mw-api-ext [deployment-charts] - 10https://gerrit.wikimedia.org/r/935685 [09:14:36] (03PS2) 10Clément Goubert: admin_ng: Raise resource limits for mw-api-ext [deployment-charts] - 10https://gerrit.wikimedia.org/r/935685 (https://phabricator.wikimedia.org/T341114) [09:15:44] (03CR) 10Fabfur: [V: 03+1] haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:15:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1014.eqiad.wmnet [09:17:32] (03PS4) 10Jelto: gitlab: sync all configured providers [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:17:34] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Raise resource limits for mw-api-ext [deployment-charts] - 10https://gerrit.wikimedia.org/r/935685 (https://phabricator.wikimedia.org/T341114) (owner: 10Clément Goubert) [09:17:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow1002.eqiad.wmnet to drbd [09:18:43] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:18:46] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:19:51] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:19:51] (03Merged) 10jenkins-bot: admin_ng: Raise resource limits for mw-api-ext [deployment-charts] - 10https://gerrit.wikimedia.org/r/935685 (https://phabricator.wikimedia.org/T341114) (owner: 10Clément Goubert) [09:19:57] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:20:05] (03PS1) 10Giuseppe Lavagetto: docker::builder: allow using bookworm as a base image [puppet] - 10https://gerrit.wikimedia.org/r/935686 (https://phabricator.wikimedia.org/T341115) [09:20:33] Any issue with m5 database access over certain region? [09:20:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:21:24] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:21:30] (03PS1) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [09:21:31] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:21:58] (03CR) 10Jbond: [C: 03+1] gitlab: sync all configured providers [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:21:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:22:05] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:22:17] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:22:23] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:22:39] (03PS1) 10Btullis: Revert "Temporarily disable the spark jobs that are running on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/935491 (https://phabricator.wikimedia.org/T332765) [09:22:53] (03CR) 10Klausman: [C: 03+1] ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935676 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [09:23:29] (03CR) 10Btullis: [C: 03+2] Revert "Temporarily disable the spark jobs that are running on an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/935491 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [09:23:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:23:46] (03CR) 10Vgutierrez: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:24:06] !log cgoubert@deploy1002 Started scap: (no justification provided) [09:24:29] !log redeploy mw-on-k8s following quota update - T341114 [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:37] T341114: MediaWiki deployment to kubernetes fails on group1 promotion - https://phabricator.wikimedia.org/T341114 [09:24:57] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:25:00] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:26:26] !log cgoubert@deploy1002 Finished scap: (no justification provided) (duration: 02m 19s) [09:27:09] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:27:16] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:27:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:27:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow1002.eqiad.wmnet to drbd [09:28:01] PROBLEM - Host netflow1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:24] kart_: not to my knowledge [09:28:43] RECOVERY - Host netflow1002 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [09:29:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [09:30:48] Amir1: https://phabricator.wikimedia.org/T341117 filed this. [09:31:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [09:31:44] !log Sending 0.5% of global traffic to mw-on-k8s - T341078 [09:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:47] T341078: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 [09:32:06] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect 0.5% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/935466 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [09:32:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:32:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:33:04] (03PS2) 10Hashar: ci: enabling docker requires the docker-ce package [puppet] - 10https://gerrit.wikimedia.org/r/935471 (https://phabricator.wikimedia.org/T341051) [09:33:25] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [09:34:05] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:07] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:21] (03PS1) 10Btullis: Configure spark3 defaults to use the new fetch protocol [puppet] - 10https://gerrit.wikimedia.org/r/935690 (https://phabricator.wikimedia.org/T332765) [09:34:43] kart_: Amir1: I strongly suspect it's the same thing we had with linkrecommandation and toolhub, so kubernetes network policies blocking traffic to new set of dbproxies [09:34:45] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [09:35:05] yeah, it's probably a k8s networking issue [09:35:31] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [09:35:33] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:35:37] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:35:53] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42255/console" [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:36:01] And, it seems k8s worker restarting around hitting this error. [09:36:09] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42256/console" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [09:36:12] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:36:14] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:36:31] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [09:36:40] (03PS1) 10Btullis: Revert "Temporarily disable gobblin jobs on the analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/935492 (https://phabricator.wikimedia.org/T332765) [09:36:48] taavi: Can you give phab task links to those issue if already there/resolved? [09:37:01] !log running puppet on 'A:cp-text and P:trafficserver::backend' - T341078 [09:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:04] T341078: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 [09:37:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:37:28] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM! Timo, let me know when you are online and we can merge" [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [09:37:30] (03CR) 10Btullis: [C: 03+2] Revert "Temporarily disable gobblin jobs on the analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/935492 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [09:38:27] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:38:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [09:38:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [09:39:22] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:39:25] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:40:12] (03PS1) 10Marostegui: wmnet: Fail back to dbproxy1017 for m5 [dns] - 10https://gerrit.wikimedia.org/r/935692 [09:40:37] (03CR) 10Arturo Borrero Gonzalez: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [09:40:49] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw - aborrero@cumin1001" [09:40:50] (03PS1) 10Giuseppe Lavagetto: Add bookworm to the local build configurations [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935693 (https://phabricator.wikimedia.org/T341115) [09:40:52] (03PS1) 10Giuseppe Lavagetto: images: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935694 (https://phabricator.wikimedia.org/T341115) [09:41:00] (03PS1) 10Giuseppe Lavagetto: istio: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935695 (https://phabricator.wikimedia.org/T341115) [09:41:04] (03PS1) 10Giuseppe Lavagetto: cert-manager: convert use of seed_image to image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935696 (https://phabricator.wikimedia.org/T341115) [09:41:08] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: sync all configured providers [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:41:10] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:41:10] (03CR) 10Volans: "Couple of questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [09:41:16] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:41:22] (03CR) 10Marostegui: [C: 03+2] wmnet: Fail back to dbproxy1017 for m5 [dns] - 10https://gerrit.wikimedia.org/r/935692 (owner: 10Marostegui) [09:41:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw - aborrero@cumin1001" [09:41:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:41:43] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: sync all configured providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [09:43:53] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:45:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [09:46:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet [09:47:30] (03PS1) 10Jbond: puppetdb::app: fix docs [puppet] - 10https://gerrit.wikimedia.org/r/935698 [09:47:46] (03CR) 10Arturo Borrero Gonzalez: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [09:47:56] (03PS2) 10Arturo Borrero Gonzalez: private.eqiad.wikimedia.cloud: introduce support for new zone [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) [09:47:57] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw - aborrero@cumin1001" [09:48:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw - aborrero@cumin1001" [09:48:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [09:52:55] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Urbanecm) @SLyngshede-WMF I disagree with the "Low" priority; there is a //lot of// unsolicited private messages going, which is annoying even with +g set (a... [09:53:21] (03PS1) 10Gmodena: mw-page-content-change-enrich revert docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935700 (https://phabricator.wikimedia.org/T341096) [09:53:28] (03CR) 10CI reject: [V: 04-1] mw-page-content-change-enrich revert docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935700 (https://phabricator.wikimedia.org/T341096) (owner: 10Gmodena) [09:54:06] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Urbanecm) Actually... I just figured I can do `/ignore *@anonymous.user`, which should sufficiently resolve this on the client side. But still, this represen... [09:54:40] 10SRE, 10Traffic, 10envoy, 10serviceops: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) `max(sum by (instance) (envoy_http_downstream_cx_active))` over the last 30 days tops out at ~... [09:55:45] (03PS2) 10Gmodena: mw-page-content-change-enrich revert docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935700 (https://phabricator.wikimedia.org/T341096) [09:56:27] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 7701 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [09:57:51] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2003 is OK: HTTP OK: HTTP/1.1 200 OK - 9053 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [09:58:00] (03CR) 10Btullis: [C: 03+1] Create spark3 local directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [09:58:34] (03PS2) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [09:58:41] (03CR) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [09:59:35] (03CR) 10Jbond: [C: 03+2] puppetdb::app: fix docs [puppet] - 10https://gerrit.wikimedia.org/r/935698 (owner: 10Jbond) [10:00:05] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10stwalkerster) I'm hoping to look tonight into whether I have the skills to sensibly patch-out PMs entirely, but I don't want to claim the task until I've got... [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T1000) [10:05:13] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [10:05:38] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [10:06:15] (03CR) 10Elukey: [C: 03+1] ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935676 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [10:07:17] (03PS1) 10JMeybohm: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T300324) [10:10:10] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/935690 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [10:12:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet [10:12:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet [10:13:03] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10BTullis) >>! In T300324#8988266, @JMeybohm wrote: > ... as datahub (cc @BTullis ) which I did not deploy because it has a huge diff I'm not able to reason about.... [10:14:21] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org [10:15:07] PROBLEM - Host kubestagetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:49] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10SLyngshede-WMF) @MoritzMuehlenhoff did you rebuild the irc-ratbox deb for the Bullseye hosts? [10:17:09] (03PS1) 10Kosta Harlan: common/gitlab_runner: Allow mariadb:* images for allowed_docker_services [puppet] - 10https://gerrit.wikimedia.org/r/935703 (https://phabricator.wikimedia.org/T339352) [10:18:07] (03CR) 10Btullis: [C: 03+2] Configure spark3 defaults to use the new fetch protocol [puppet] - 10https://gerrit.wikimedia.org/r/935690 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [10:19:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet [10:19:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1015.eqiad.wmnet [10:19:56] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10taavi) >>! In T341097#8989998, @stwalkerster wrote: > I also don't know how this is built/packaged/etc - I see there's a standalone patch file in the repo wh... [10:20:17] (03PS2) 10JMeybohm: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T300324) [10:20:41] RECOVERY - Host kubestagetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [10:22:25] !log restore US business hours escalation - T340763 [10:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:28] T340763: Adjusting On-Call Escalation Policies in Splunk for Upcoming 2023 July 4th - https://phabricator.wikimedia.org/T340763 [10:24:01] (03PS3) 10JMeybohm: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) [10:25:24] (03PS16) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [10:26:27] (03CR) 10Fabfur: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:28:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1016.eqiad.wmnet [10:30:22] (03CR) 10Elukey: "I would personally do the opposite, namely:" [puppet] - 10https://gerrit.wikimedia.org/r/933387 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [10:32:31] (03CR) 10Elukey: analytics: Remove analytics1064_1069 from hdfs net_topology (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933387 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [10:33:15] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kostajh) In Gerrit / PipelineLib workflow, the PipelineBot makes a comment in Gerrit with the newly published image tag names, [example](https://gerr... [10:34:56] (03PS17) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [10:35:30] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935705 [10:37:26] (03PS1) 10Jelto: gitlab: remove unused sso hiera config [puppet] - 10https://gerrit.wikimedia.org/r/935707 (https://phabricator.wikimedia.org/T320390) [10:37:28] (03PS1) 10Jelto: gitlab: use openid_connect as default sso method [puppet] - 10https://gerrit.wikimedia.org/r/935708 (https://phabricator.wikimedia.org/T320390) [10:37:32] (03PS1) 10MVernon: Hiera: delete search_backup swift user [labs/private] - 10https://gerrit.wikimedia.org/r/935709 (https://phabricator.wikimedia.org/T341081) [10:37:42] (03PS1) 10MVernon: Hiera: delete search:backup swift user [puppet] - 10https://gerrit.wikimedia.org/r/935710 (https://phabricator.wikimedia.org/T341081) [10:38:40] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935705 (owner: 10Kosta Harlan) [10:39:26] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935705 (owner: 10Kosta Harlan) [10:39:51] (03PS2) 10MVernon: Hiera: delete search:backup swift user [puppet] - 10https://gerrit.wikimedia.org/r/935710 (https://phabricator.wikimedia.org/T341081) [10:39:59] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42257/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [10:40:29] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:40:56] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42258/console" [puppet] - 10https://gerrit.wikimedia.org/r/935707 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [10:41:00] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:42:52] (03CR) 10Jcrespo: [C: 03+1] Hiera: delete search:backup swift user [puppet] - 10https://gerrit.wikimedia.org/r/935710 (https://phabricator.wikimedia.org/T341081) (owner: 10MVernon) [10:43:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42259/console" [puppet] - 10https://gerrit.wikimedia.org/r/935708 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [10:43:44] (03CR) 10Jcrespo: [C: 03+1] Hiera: delete search_backup swift user [labs/private] - 10https://gerrit.wikimedia.org/r/935709 (https://phabricator.wikimedia.org/T341081) (owner: 10MVernon) [10:44:07] (03CR) 10MVernon: [C: 03+2] Hiera: delete search:backup swift user [puppet] - 10https://gerrit.wikimedia.org/r/935710 (https://phabricator.wikimedia.org/T341081) (owner: 10MVernon) [10:44:49] (03CR) 10MVernon: [V: 03+2 C: 03+2] Hiera: delete search_backup swift user [labs/private] - 10https://gerrit.wikimedia.org/r/935709 (https://phabricator.wikimedia.org/T341081) (owner: 10MVernon) [10:45:10] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10MoritzMuehlenhoff) >>! In T341097#8990036, @SLyngshede-WMF wrote: > @MoritzMuehlenhoff did you rebuild the irc-ratbox deb for the Bullseye hosts? For the WI... [10:45:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet [10:47:52] (03PS8) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T277604) [10:48:11] PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [10:48:42] (03PS1) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [10:49:04] (03CR) 10Hashar: "Attached to T277604 "Permissions / ownership interfere with publishing dev-images" which got filed after granting permissions to fundraisi" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T277604) (owner: 10Hashar) [10:49:30] (03CR) 10CI reject: [V: 04-1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [10:50:45] RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [10:51:20] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/927975/1915/" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T277604) (owner: 10Hashar) [10:51:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet [10:51:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1016.eqiad.wmnet [10:52:46] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe [10:53:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet [10:54:04] (03CR) 10AikoChou: [C: 03+2] ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935676 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [10:55:18] (03Merged) 10jenkins-bot: ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/935676 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [10:55:50] (03PS2) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [10:56:37] (03CR) 10CI reject: [V: 04-1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [10:59:29] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:48] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts urldownloader2001.wikimedia.org [11:00:55] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe [11:01:03] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:23] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy revscoring models for test.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/935712 (https://phabricator.wikimedia.org/T319170) [11:01:26] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [11:01:27] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:21] (03CR) 10AOkoth: [C: 03+1] vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [11:03:18] (03PS1) 10Muehlenhoff: Remove old URL downloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/935713 (https://phabricator.wikimedia.org/T329945) [11:03:48] 10SRE-swift-storage, 10serviceops, 10Patch-For-Review: Remove search:backup swift account and storage - https://phabricator.wikimedia.org/T341081 (10MatthewVernon) 05Open→03Resolved All done, including roll-restart of the proxies to make this change take effect. [11:05:39] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:45] (03PS3) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [11:05:47] (03PS1) 10JMeybohm: rake_modules/taskgen: Don't process non files in setup_python_extensions [puppet] - 10https://gerrit.wikimedia.org/r/935714 [11:06:03] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:06:39] (03CR) 10CI reject: [V: 04-1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [11:06:41] (03PS1) 10Hnowlan: requirements: bump pyssim [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 [11:09:16] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:09:20] (03CR) 10Elukey: "On swift I see:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935712 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:10:37] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:29] (03CR) 10Elukey: [C: 03+1] "Other than the other comment the rest looks good :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935712 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:11:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) Everything looks good. mw-api-ext: {F37129502} {F37129504} {F37129506} mw-web: {F37129508} {F37129510} {F37129512} [11:12:09] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:16] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! All the referenced files seem to have already been generated by the netbox cookbook so good to go." [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:14:44] (03PS2) 10JMeybohm: rake_modules/taskgen: Don't process direcories in setup_python_extensions [puppet] - 10https://gerrit.wikimedia.org/r/935714 [11:14:47] (03PS4) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [11:14:53] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] private.eqiad.wikimedia.cloud: introduce support for new zone [dns] - 10https://gerrit.wikimedia.org/r/935446 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:16:33] (03CR) 10Clément Goubert: [C: 03+2] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T277604) (owner: 10Hashar) [11:21:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [11:22:43] !of restarting archiva on archiva.wikimedia.org to pick up Java security updates [11:25:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:25:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:25:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts urldownloader2001.wikimedia.org [11:25:39] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `urldownloader2001.wikimedia.org` - urldownloader2001.wikimedia.org (**PASS... [11:25:49] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts urldownloader2002.wikimedia.org [11:27:39] (03CR) 10Hnowlan: "numpy-related DeprecationWarnings have been removed from test output." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/935715 (owner: 10Hnowlan) [11:27:54] (03PS2) 10Hashar: contint: rm git safe.directory for dev-images [puppet] - 10https://gerrit.wikimedia.org/r/934268 (https://phabricator.wikimedia.org/T335354) [11:27:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [11:28:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet [11:28:55] (03CR) 10Clément Goubert: [C: 03+1] contint: rm git safe.directory for dev-images [puppet] - 10https://gerrit.wikimedia.org/r/934268 (https://phabricator.wikimedia.org/T335354) (owner: 10Hashar) [11:31:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:32:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [11:33:24] (03CR) 10Ilias Sarantopoulos: ml-services: deploy revscoring models for test.wikipedia.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935712 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:33:57] (03PS3) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [11:34:06] (03CR) 10Clément Goubert: [C: 03+2] contint: rm git safe.directory for dev-images [puppet] - 10https://gerrit.wikimedia.org/r/934268 (https://phabricator.wikimedia.org/T335354) (owner: 10Hashar) [11:34:16] (03CR) 10Btullis: [C: 03+2] contint: rm git safe.directory for dev-images [puppet] - 10https://gerrit.wikimedia.org/r/934268 (https://phabricator.wikimedia.org/T335354) (owner: 10Hashar) [11:34:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:36:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:36:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:36:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts urldownloader2002.wikimedia.org [11:36:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `urldownloader2002.wikimedia.org` - urldownloader2002.wikimedia.org (**PASS... [11:37:02] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts urldownloader1002.wikimedia.org [11:37:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/935444 (https://phabricator.wikimedia.org/T332765) (owner: 10Stevemunene) [11:38:26] (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/935103 (https://phabricator.wikimedia.org/T317032) (owner: 10Majavah) [11:39:00] (03PS16) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [11:39:56] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [11:40:10] (03CR) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [11:40:34] (03PS4) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [11:40:54] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy revscoring models for test.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/935712 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:41:15] (03PS1) 10Alexandros Kosiaris: ipoid: Remove APP_CONFIG env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/935720 [11:41:41] (03Merged) 10jenkins-bot: ml-services: deploy revscoring models for test.wikipedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/935712 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:43:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:43:35] (03Abandoned) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 (owner: 10Thiemo Kreuz (WMDE)) [11:43:39] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10MoritzMuehlenhoff) >>! In T159412#8986264, @jbond wrote: > I personally dont think its to bad to include e.g. role::mediaw... [11:43:57] (03PS1) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack cloudsw gw support [puppet] - 10https://gerrit.wikimedia.org/r/935721 (https://phabricator.wikimedia.org/T341063) [11:44:48] (03CR) 10Muehlenhoff: Add missing build dependencies for the Debian package (031 comment) [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [11:45:17] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [11:45:18] (03CR) 10Hashar: "And it worked!" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T277604) (owner: 10Hashar) [11:45:24] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:45:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet [11:45:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:46:10] (03CR) 10CI reject: [V: 04-1] wmcs: cloud_private_subnet: introduce per-rack cloudsw gw support [puppet] - 10https://gerrit.wikimedia.org/r/935721 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:47:26] (03CR) 10Slyngshede: "This a suggestion for a work-around for the issues with Wikitech overwriting the email in LDAP with empty strings. The IDM will create a u" [software/bitu] - 10https://gerrit.wikimedia.org/r/935376 (owner: 10Slyngshede) [11:47:44] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [11:47:50] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:48:22] (03PS2) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack cloudsw gw support [puppet] - 10https://gerrit.wikimedia.org/r/935721 (https://phabricator.wikimedia.org/T341063) [11:49:18] (03PS1) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation 10th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) [11:50:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [11:50:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:50:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:50:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts urldownloader1002.wikimedia.org [11:51:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `urldownloader1002.wikimedia.org` - urldownloader1002.wikimedia.org (**PASS... [11:51:17] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts urldownloader1001.wikimedia.org [11:52:42] (03PS2) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation 10, 11th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) [11:53:21] (03PS1) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) [11:55:21] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:55:37] (03CR) 10CI reject: [V: 04-1] wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:57:53] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! I'm not an expert on puppet syntax but the logic looks correct." [puppet] - 10https://gerrit.wikimedia.org/r/935721 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:58:49] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:59:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: urldownloader1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:59:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:59:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts urldownloader1001.wikimedia.org [12:00:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `urldownloader1001.wikimedia.org` - urldownloader1001.wikimedia.org (**PASS... [12:02:33] (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrich revert docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935700 (https://phabricator.wikimedia.org/T341096) (owner: 10Gmodena) [12:11:21] (03PS2) 10JMeybohm: envoy: Refactor max_requests_per_connection [puppet] - 10https://gerrit.wikimedia.org/r/935678 (https://phabricator.wikimedia.org/T304124) [12:11:23] (03PS2) 10JMeybohm: envoy: Remove tls_minimum_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) [12:11:25] (03PS3) 10JMeybohm: rake_modules/taskgen: Don't process direcories in setup_python_extensions [puppet] - 10https://gerrit.wikimedia.org/r/935714 [12:11:27] (03PS5) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [12:14:42] (03PS18) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) [12:17:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: cloud_private_subnet: introduce per-rack cloudsw gw support [puppet] - 10https://gerrit.wikimedia.org/r/935721 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [12:19:45] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42262/console" [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [12:22:30] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/935729 [12:22:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [12:23:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [12:24:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/935729 (owner: 10Arturo Borrero Gonzalez) [12:25:04] PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30402) = 12.5% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [12:25:53] (03PS1) 10Jbond: puppetmaster::puppetdb: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/935731 (https://phabricator.wikimedia.org/T330490) [12:25:55] (03PS1) 10Jbond: puppetmaster::puppetdb: allow users to configure ssl_verify_client [puppet] - 10https://gerrit.wikimedia.org/r/935732 [12:26:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42264/console" [puppet] - 10https://gerrit.wikimedia.org/r/935732 (owner: 10Jbond) [12:26:51] (03PS2) 10Jbond: puppetmaster::puppetdb: allow users to configure ssl_verify_client [puppet] - 10https://gerrit.wikimedia.org/r/935732 (https://phabricator.wikimedia.org/T338811) [12:28:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42265/console" [puppet] - 10https://gerrit.wikimedia.org/r/935732 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:28:23] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [12:28:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [12:28:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935707 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:29:30] (03CR) 10Jbond: [C: 03+1] gitlab: use openid_connect as default sso method [puppet] - 10https://gerrit.wikimedia.org/r/935708 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:29:34] (03CR) 10CI reject: [V: 04-1] puppetmaster::puppetdb: allow users to configure ssl_verify_client [puppet] - 10https://gerrit.wikimedia.org/r/935732 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [12:30:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet [12:30:26] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw-b1 codfw - aborrero@cumin2002" [12:31:10] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw-b1 codfw - aborrero@cumin2002" [12:31:10] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:31:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935714 (owner: 10JMeybohm) [12:32:27] (03PS5) 10Jelto: sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) [12:34:12] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudsw-b1.private.codfw.wikimedia.cloud on all recursors [12:34:14] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudsw-b1.private.codfw.wikimedia.cloud on all recursors [12:34:47] (03PS3) 10Jbond: puppetmaster::puppetdb: allow users to configure ssl_verify_client [puppet] - 10https://gerrit.wikimedia.org/r/935732 (https://phabricator.wikimedia.org/T338811) [12:34:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1019.eqiad.wmnet [12:34:55] (03CR) 10Jelto: sre: add gitlab ci alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [12:36:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42266/console" [puppet] - 10https://gerrit.wikimedia.org/r/935732 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:38:56] (03PS1) 10Jbond: puppedb::bookworm: Force client auth [puppet] - 10https://gerrit.wikimedia.org/r/935733 (https://phabricator.wikimedia.org/T338811) [12:40:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42267/console" [puppet] - 10https://gerrit.wikimedia.org/r/935733 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:40:28] (03CR) 10Jbond: [C: 03+2] puppetmaster::puppetdb: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/935731 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:40:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::puppetdb: allow users to configure ssl_verify_client [puppet] - 10https://gerrit.wikimedia.org/r/935732 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:41:22] (03CR) 10JMeybohm: "PCC: https://puppet-compiler.wmflabs.org/output/935711/42261/" [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [12:41:39] (03CR) 10JMeybohm: "PCC: https://puppet-compiler.wmflabs.org/output/935711/42261/" [puppet] - 10https://gerrit.wikimedia.org/r/935678 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm) [12:41:43] (03CR) 10JMeybohm: "PCC: https://puppet-compiler.wmflabs.org/output/935711/42261/" [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm) [12:42:51] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: use openid_connect as default sso method [puppet] - 10https://gerrit.wikimedia.org/r/935708 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:42:59] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove unused sso hiera config [puppet] - 10https://gerrit.wikimedia.org/r/935707 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:44:13] (03CR) 10Jcrespo: "I am going on vacations. As I said before, I see no blockers on the current method, I just asked for a puppet compilation to make sure it " [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:46:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [12:53:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet [12:53:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1019.eqiad.wmnet [12:54:19] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:58:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T1300). [13:00:06] aanzx and Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:01:08] o/ [13:01:14] I can deploy [13:02:35] * Lucas_WMDE knows nothing about how to deploy private changes though [13:02:49] so that might be an issue for Daimona :/ [13:02:56] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:02:57] Right, and I have no idea too :O [13:03:23] (Also, I should maybe clarify that the private change is also for beta) [13:04:04] * Lucas_WMDE checks which skins are available on frwikinews [13:04:58] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) GitLab replicas (e.g. https://gitlab-replica.wikimedia.org) use oidc as the default login method now. Login (normal user login a... [13:05:20] aanzx: apparently frwikinews still has monobook available in the preferences, any idea if the allowed skins should include that? [13:06:21] though OTOH the same is apparently true of hewiki too [13:06:31] (03PS2) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) [13:06:43] or htwiki for that matter [13:06:56] so I guess using the same list as those wikis is fine 🤷 [13:07:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "run CI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) (owner: 10Anzx) [13:07:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [13:08:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:08:43] (03CR) 10CI reject: [V: 04-1] wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [13:10:18] (03PS3) 10Ssingh: Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 [13:10:21] oof, CI is quite busy [13:11:58] aanzx: are you there? [13:13:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:14:32] okay I found the private repo in git at least [13:14:34] (03PS1) 10Ilias Sarantopoulos: httpbb: add testwiki model tests [puppet] - 10https://gerrit.wikimedia.org/r/935742 (https://phabricator.wikimedia.org/T319170) [13:14:39] I guess I just… edit the file there, commit it, and don’t push it anywhere? [13:14:47] and possibly do a scap sync-file, not sure [13:16:43] (03CR) 10Ssingh: "Thanks for the reviews! Decided to address the nits in this release itself." [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [13:16:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [13:17:00] 10SRE, 10Traffic-Icebox: Alert in case of significant discrepancies between the number of nginx and varnish responses - https://phabricator.wikimedia.org/T232574 (10fgiunchedi) Ticket went stale / obsolete, untagging observability [13:17:07] Daimona: will there eventually be a corresponding secret for production as well? [13:17:14] Yeah [13:17:19] It'll be different tho [13:17:24] ok [13:17:32] just want to know for the commit message [13:17:37] afk [13:18:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:18:53] back [13:21:06] Lucas_WMDE: I will add monobook in a moment [13:21:16] ok [13:21:22] if you think it’s needed [13:21:28] (I changed my mind and thought leaving it out seems fine too) [13:21:42] Daimona: and the setting name is not private, right? I can put that in e.g. the log message [13:21:50] Yup [13:21:53] ok [13:22:21] (03CR) 10CI reject: [V: 04-1] Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [13:23:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:23:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [13:23:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [13:24:00] (03CR) 10Elukey: [C: 03+2] httpbb: add testwiki model tests [puppet] - 10https://gerrit.wikimedia.org/r/935742 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:24:11] (03CR) 10Ssingh: "E: dnsdist: init.d-script-does-not-implement-required-option etc/init.d/dnsdist force-reload" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [13:24:36] (03PS6) 10Anzx: Enable Extension:RelatedArticles for desktop on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) [13:24:47] (03PS7) 10Anzx: Enable Extension:RelatedArticles for desktop on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) [13:24:52] (03PS4) 10Ssingh: Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 [13:25:20] @Lucas_WMDE: changed [13:25:46] aanzx: now it’s an empty array [13:26:00] I don’t think that’s what you intended? [13:26:22] It was done for zhwikinews [13:27:04] lol, apparently I merged that one https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/762761 [13:27:07] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T341021 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. resolving [13:27:08] * Lucas_WMDE has no memory of this place [13:28:00] ok, the empty array is the same as allowing all skins, because the extension likes to be confusing I guess [13:28:12] I probably complained about this last year too [13:28:13] Yes [13:28:22] but sure then let’s go for that [13:28:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [13:28:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "one more CI run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) (owner: 10Anzx) [13:28:51] * Lucas_WMDE has no memory of this place <-- ohhhh, a man of culture :P [13:28:59] (03PS1) 10Ilias Sarantopoulos: ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) [13:29:02] :P [13:31:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM (in this setting, empty array means allow all skins)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) (owner: 10Anzx) [13:31:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) (owner: 10Anzx) [13:32:39] (03CR) 10Vgutierrez: [C: 03+1] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [13:32:45] 10ops-codfw, 10Traffic: lvs2013 ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10ssingh) [13:32:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:33:44] (03CR) 10Jforrester: Add performer_pageview_id & performer_is_bot to wikifunctions.ui stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935536 (https://phabricator.wikimedia.org/T338005) (owner: 10David Martin) [13:35:22] Lucas_WMDE: I dont know where private settings are stored on beta. Most probably on deployment-deploy03 maybe in /srv/mediawiki-staging/private ? [13:36:01] yeah there’s a git repo there at least [13:36:13] yeah that looks valid then edit PrivateSettings.php :] [13:36:22] and then commit and scap sync-file? [13:36:28] I assume so yeah [13:36:31] ok thanks! [13:36:38] will do that after the other config change then [13:36:55] Jenkins does a scap every 10 minutes as well [13:37:02] but you should be able to manually sync it [13:39:23] 10ops-codfw, 10Traffic: lvs2013 ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10Jhancock.wm) I found the idrac light blinking rapidly in amber. Quick Sync is not responding. I tried rebooting just the idrac but it hasn't helped. The next troubleshooting step is to reboot the server. @ssingh... [13:41:13] !log disable puppet and stop pybal on lvs2013: T340960 [13:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:16] T340960: lvs2013 ManagementSSHDown - https://phabricator.wikimedia.org/T340960 [13:43:32] (03Merged) 10jenkins-bot: Enable Extension:RelatedArticles for desktop on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935488 (https://phabricator.wikimedia.org/T341105) (owner: 10Anzx) [13:44:14] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:935488|Enable Extension:RelatedArticles for desktop on frwikinews (T341105)]] [13:44:17] T341105: Extension:RelatedArticles on frwikinews - https://phabricator.wikimedia.org/T341105 [13:45:44] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and anzx: Backport for [[gerrit:935488|Enable Extension:RelatedArticles for desktop on frwikinews (T341105)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:45:51] aanzx: can you test it on mwdebug? [13:45:57] Yes [13:47:03] ok [13:47:29] Lucas_WMDE: works fine [13:47:41] ok thanks [13:47:43] syncing [13:47:48] (03PS1) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [13:48:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:48:52] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:49:28] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:50:27] ^^ that's expected, lvs2013 is depooled [13:51:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs2013.codfw.wmnet with reason: mgmt interface issues [13:51:20] ^^ sukhe [13:51:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs2013.codfw.wmnet with reason: mgmt interface issues [13:51:24] yeah [13:51:24] 10ops-codfw, 10Traffic: lvs2013 ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f6099155-97b3-49c3-9c11-36962a3c834b) set by vgutierrez@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: mgmt interface issu... [13:51:27] I didn't downtime on purpose though [13:51:29] but it's fine [13:51:41] but yes, expeted [13:51:43] sukhe: router will scream anyways [13:51:45] *expected [13:51:59] but no need to flood with errors while the server is rebooted IMHO [13:52:23] I guess, my rationale is that it's easier to catch errors here versus actively looking for them and it doesn't page [13:53:03] (03CR) 10Clément Goubert: [C: 03+1] Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [13:53:09] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:935488|Enable Extension:RelatedArticles for desktop on frwikinews (T341105)]] (duration: 08m 54s) [13:53:12] T341105: Extension:RelatedArticles on frwikinews - https://phabricator.wikimedia.org/T341105 [13:54:06] alright, let’s try that private config for Daimona [13:54:20] (03PS1) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [13:54:21] Fingers crossed [13:54:22] (03PS1) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [13:55:34] (03CR) 10Kamila Součková: [C: 03+1] api-gateway: add native AQS1-style routes for AQS services [deployment-charts] - 10https://gerrit.wikimedia.org/r/935457 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [13:55:50] !log expand kafka topic partitions from 1 to 5 for {codfw,eqiad}.mediawiki.job.RecordLintJob and {eqiad,codfw}.mediawiki.job.refreshLinks on kafka-main eqiad/codfw - T338357 [13:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:53] T338357: Pushing jobs to jobqueue is slow again - https://phabricator.wikimedia.org/T338357 [13:56:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Let's send an email letting all of SRE we added a new paging alert before merging this one." [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [13:56:21] * Lucas_WMDE configures git user name and email [13:57:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1021.eqiad.wmnet [13:57:02] git defaults to joe’s own editor, mercy [13:57:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [13:58:03] ok, syncing in beta [13:58:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:59:01] Lucas_WMDE: in my env I have `EDITOR=/usr/bin/vim` :) [13:59:11] I think that is honored by git [13:59:30] sukhe: ^^ that RECOVERY doesn't look very good [13:59:45] yeah, for some reason, the cumin command didn't go through [13:59:49] hashar: I used GIT_EDITOR=vim git -C private/ commit PrivateSettings.php ^^ [13:59:50] manually disabled puppet now [13:59:58] Lucas_WMDE: +1 :) [14:00:06] sukhe: stopped pybal again? [14:00:09] yes [14:00:11] ack [14:00:15] weird right? [14:00:21] defo [14:00:22] I got a PASS on cumin and puppet was disabled [14:00:26] and then enabled again? [14:00:50] if you need a hand just shout [14:00:53] is it ok for me to continue deploying? (beta-only config change but it would at least git pull in production) [14:01:03] volans: nah.. but it seems that puppet got re-enabled on server reboot [14:01:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [14:01:23] that seems weird [14:01:32] via a cookbook or manual steps? [14:01:51] in a meeting [14:01:52] but will explain [14:02:00] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935141 [14:02:28] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935142 [14:02:41] I’ll go ahead with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/935463/ (CS-labs.php only) unless someone tells me not to [14:02:55] and assume that the pybal thing isn’t a blocker for it [14:03:13] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935095 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:03:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935463 (https://phabricator.wikimedia.org/T320258) (owner: 10Daimona Eaytoy) [14:05:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] mesh.configuration: Refactor max_requests_per_connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/935680 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm) [14:05:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add mesh.configuration 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935679 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:05:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] mesh.configuration: Remove tls_minimum_protocol_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935684 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm) [14:05:54] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "LGTM, I can take care of deploying this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935713 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [14:06:59] (03PS1) 10Fabfur: Revert "haproxy: support different actions for tls and http frontend" [puppet] - 10https://gerrit.wikimedia.org/r/935495 [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [14:08:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [14:08:22] (03Merged) 10jenkins-bot: beta: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935463 (https://phabricator.wikimedia.org/T320258) (owner: 10Daimona Eaytoy) [14:08:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Biggest comment is inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:08:37] (03PS1) 10Muehlenhoff: Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [14:08:55] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrich revert docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935700 (https://phabricator.wikimedia.org/T341096) (owner: 10Gmodena) [14:08:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Refactor max_requests_per_connection [puppet] - 10https://gerrit.wikimedia.org/r/935678 (https://phabricator.wikimedia.org/T304124) (owner: 10JMeybohm) [14:09:02] (03CR) 10CI reject: [V: 04-1] Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:09:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet [14:09:31] (03CR) 10Fabfur: [C: 03+2] Revert "haproxy: support different actions for tls and http frontend" [puppet] - 10https://gerrit.wikimedia.org/r/935495 (owner: 10Fabfur) [14:09:54] (03PS7) 10Jbond: puppetdb: Add ability to configure secondary proxies [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) [14:09:59] Daimona: should be deployed within 10-20 minutes I believe [14:10:03] (03Merged) 10jenkins-bot: Remove old URL downloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/935713 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [14:10:10] (the scap backport command already finished, it’s not getting synced to prod) [14:10:22] Amazing, thank you :) [14:10:35] probably in 5 minutes from now https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ [14:10:43] leaving the terminal open in case it doesn’t work or needs to be reverted [14:11:43] (03Merged) 10jenkins-bot: mw-page-content-change-enrich revert docker image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/935700 (https://phabricator.wikimedia.org/T341096) (owner: 10Gmodena) [14:11:50] !log disabling puppet in all cp- hosts for error in configuration [14:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:16] (03CR) 10Jgreen: [C: 03+2] Remove hosts to be decommissioned. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933198 (https://phabricator.wikimedia.org/T340155) (owner: 10Dwisehaupt) [14:13:01] (03CR) 10JMeybohm: [C: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:13:22] (03CR) 10JMeybohm: [C: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:14:30] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:15:59] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:16:15] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [14:16:19] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:16:48] !log depool cdn service in cp2027.codfw.wmnet,cp1075.eqiad.wmnet,cp3050.esams.wmnet [14:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:01] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:17:02] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:12] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:18:14] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [14:18:21] (03PS1) 10Samuel (WMF): replicas: redact revdeleted, oversighted information [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) [14:18:28] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:18:39] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:18:41] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [14:18:59] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10akosiaris) >>! In T340955#8989979, @JMeybohm wrote: > `max(sum by (instance) (envo... [14:19:02] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:19:33] (03CR) 10Alexandros Kosiaris: mesh.configuration: Limit the total number of active connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [14:19:48] (03PS1) 10Jelto: gitlab: update gitaly prometheus exporter config for gitlab 16 [puppet] - 10https://gerrit.wikimedia.org/r/935753 (https://phabricator.wikimedia.org/T338460) [14:22:11] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:22:17] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:23:22] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42268/console" [puppet] - 10https://gerrit.wikimedia.org/r/935753 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [14:24:38] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:24:42] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:26:06] alright, beta should be synced now I think [14:27:25] Lucas_WMDE: Thanks, it works perfectly! [14:27:46] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:27:47] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:28:03] (03CR) 10Jelto: [V: 03+1] "This change makes the gitaly config compatible with the new major version." [puppet] - 10https://gerrit.wikimedia.org/r/935753 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [14:28:54] Daimona: great, thanks for checking! [14:29:00] !log UTC afternoon backport+config window done [14:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:22] I have no idea if the private change was synced correctly, but that is not for me to decide. All I have to decide is whether it works correctly from a user's perspective. [14:29:42] I guess it must’ve been synced in a way that works, at least [14:30:11] Yup, and that's good enough (TM) [14:30:58] 10SRE-swift-storage, 10MediaWiki-extensions-Phonos, 10Wikimedia-production-error: Steady rate of Phonos Swift errors (inc. DescribeFileOp failed, FileBackendStore::ingestFreshFileStats: Could not stat) - https://phabricator.wikimedia.org/T329249 (10KSiebert) [14:31:04] (03PS3) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) [14:31:43] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jhathaway) >>! In T340557#8985560, @jbond wrote: > And use this for talking to the pki infrastructure. This is a... [14:33:06] (03CR) 10CI reject: [V: 04-1] wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [14:33:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [14:33:51] jbond: merged your change on private.git btw [14:34:32] the public private.git, you get the idea [14:35:00] (03PS4) 10JMeybohm: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) [14:35:33] (03CR) 10Btullis: "Adding milimetric for a review on the patch." [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [14:35:38] 10ops-codfw, 10Traffic: lvs2013 ManagementSSHDown - https://phabricator.wikimedia.org/T340960 (10ssingh) 05Open→03Resolved a:03ssingh Thanks to @Jhancock.wm for the quick resolution of this issue! [14:35:39] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:45] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10MoritzMuehlenhoff) 05Open→03Resolved We're running new URL downloaders on Bullseye (urldownloader[12]00[34].wikimedia.org) an the one Buster systems have been decommissioned. [14:37:24] 10SRE-swift-storage, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10KSiebert) [14:37:28] (03PS1) 10JMeybohm: mesh.configuration: Update all charts t 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935754 (https://phabricator.wikimedia.org/T300324) [14:37:29] 10SRE-swift-storage, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10KSiebert) [14:38:46] !log pool cdn service in cp2027.codfw.wmnet,cp1075.eqiad.wmnet,cp3050.esams.wmnet [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:23] (03PS3) 10Sergio Gimeno: GrowthExperiments: Enable backend of link recommendation 10, 11, 12th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935723 (https://phabricator.wikimedia.org/T308135) [14:40:03] (03PS1) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 [14:40:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [14:40:20] (03PS6) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [14:40:22] (03PS5) 10JMeybohm: mesh.configuration: Limit the total number of active connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) [14:40:24] (03PS2) 10JMeybohm: mesh.configuration: Update all charts t 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/935754 (https://phabricator.wikimedia.org/T300324) [14:40:36] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) >>! In T340557#8990830, @jhathaway wrote: >>>! In T340557#8985560, @jbond wrote: >> And use this for talki... [14:40:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet [14:40:53] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [14:41:50] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) I've reduced the limit to 50k (which is what https://www.envoyproxy.io/d... [14:42:46] (03PS2) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) [14:42:57] !log re-enable puppet and start pybal on lvs2013 [14:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:25] (03CR) 10CI reject: [V: 04-1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [14:43:27] (03CR) 10JMeybohm: mesh.configuration: Limit the total number of active connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935702 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [14:43:29] (03PS3) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) [14:44:23] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for lvs2013.codfw.wmnet [14:44:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2013.codfw.wmnet [14:45:09] (03PS7) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [14:47:42] (03PS4) 10Jbond: puppetdb: add secondary web site to proxy requests form the puppet5 masters [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) [14:47:59] (03CR) 10CI reject: [V: 04-1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [14:48:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [14:48:14] (03PS1) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) [14:48:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42271/console" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:49:08] (03CR) 10Vgutierrez: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:49:29] (03PS8) 10JMeybohm: envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) [14:50:51] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:51:02] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [14:51:10] (03CR) 10Ssingh: [C: 03+2] P:systemd::timesyncd: automate generation of ntp_servers list [puppet] - 10https://gerrit.wikimedia.org/r/933473 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [14:52:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [14:52:23] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:52:30] ^ expected [14:52:31] (03CR) 10Fabfur: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:54:46] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42272/console" [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [14:59:54] (03CR) 10Vgutierrez: haproxy: support different actions for tls and http frontend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [15:00:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [15:00:59] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: GitLab minor version upgrade [15:02:08] (03PS2) 10Muehlenhoff: Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [15:02:31] (03CR) 10CI reject: [V: 04-1] Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:04:59] (03CR) 10Effie Mouzeli: [C: 03+1] "The latest swift unavailability was detected by accident, and quite a while after it had started" [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [15:05:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host an-test-worker1003.eqiad.wmnet [15:05:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [15:06:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [15:06:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [15:07:13] (03Merged) 10jenkins-bot: Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 (owner: 10Giuseppe Lavagetto) [15:08:22] (03PS2) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) [15:09:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [15:09:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1024.eqiad.wmnet [15:09:32] (03CR) 10Ssingh: [C: 03+2] Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [15:10:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [15:11:52] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Arnoldokoth) I ran into a 422 error trying to login to gitlab-replica. {F37129701} [15:12:14] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:13:10] (03PS4) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) [15:13:29] (03PS3) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) [15:13:51] (03CR) 10Fabfur: haproxy: support different actions for tls and http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [15:21:22] (03Abandoned) 10Giuseppe Lavagetto: envoyproxy: add spdx license headers [puppet] - 10https://gerrit.wikimedia.org/r/792969 (owner: 10Giuseppe Lavagetto) [15:23:09] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:23:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [15:24:47] (03CR) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [15:25:48] (03PS2) 10Jforrester: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) [15:25:50] (03PS4) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [15:25:52] (03PS4) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [15:25:54] (03PS2) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [15:25:56] (03PS2) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [15:25:58] (03PS4) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [15:26:00] (03PS8) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [15:26:47] !log reprepro -C component/dnsdist include bullseye-wikimedia dnsdist_1.8.0-1+wmf11u1_amd64.changes [15:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:59] (03PS3) 10Muehlenhoff: Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [15:28:26] (03CR) 10CI reject: [V: 04-1] Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:29:02] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-journalnode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:10] (03PS4) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) [15:29:12] (03PS4) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) [15:29:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [15:29:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [15:31:01] (03PS2) 10David Martin: Add performer_pageview_id & performer_is_bot to wikifunctions.ui stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935536 (https://phabricator.wikimedia.org/T338005) [15:33:22] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) At first sorry for the large delay, the last 2/3 weeks have been pretty much jumping from an interrupt to another one.... [15:33:42] (03CR) 10Jforrester: [C: 03+2] "Beta-only patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935536 (https://phabricator.wikimedia.org/T338005) (owner: 10David Martin) [15:33:44] (03PS5) 10Fabfur: haproxy: support different actions for tls and http frontend [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) [15:35:42] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 4 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42282/console" [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [15:36:09] (03PS4) 10Muehlenhoff: Add a new nftables::service define (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [15:36:15] (03CR) 10David Martin: Add performer_pageview_id & performer_is_bot to wikifunctions.ui stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935536 (https://phabricator.wikimedia.org/T338005) (owner: 10David Martin) [15:36:28] (03PS3) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [15:36:31] (03PS3) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [15:36:32] (03PS5) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [15:36:34] (03PS9) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [15:36:42] (03CR) 10Vgutierrez: [C: 03+1] "nice fix, you might want to merge this one tomorrow morning just in case ;)" [puppet] - 10https://gerrit.wikimedia.org/r/935760 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [15:38:06] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/933497/42280/" [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [15:38:13] !log mlitn@deploy1002 Started deploy [airflow-dags/platform_eng@a97da10]: (no justification provided) [15:38:14] Somehow, it looks like zuul/jenkins/integration/CI is broken. At least in the test pipeline, jobs are only queued/pending and none are actually processed? But I might be looking at it wrong? [15:38:27] (03PS5) 10Muehlenhoff: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [15:38:38] !log mlitn@deploy1002 Finished deploy [airflow-dags/platform_eng@a97da10]: (no justification provided) (duration: 00m 25s) [15:38:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) This should be discussed with Joe, since he is the original task creator, which said everything should follow the r... [15:40:55] MichaelG_WMDE: I do see some stuff running [15:40:59] It's backlogged though [15:41:00] A lot [15:41:51] hashar: ^ [15:41:54] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42284/console" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [15:42:34] Gotcha, then I'll back off a bit with end-of-day changes. Thanks @RhinosF1 [15:42:41] MichaelG_WMDE: it looks like it's been raised in -releng that not all of the new runners are behaving properly. [15:43:13] 10SRE, 10Domains: Mark Monitor administration panel (redirects for wikimedia.pl) - https://phabricator.wikimedia.org/T333827 (10Dzahn) Pretty sure that SRE is needed to add this domain to DNS and create redirects. Whether control of a single domain in MarkMonitor can be handed over to another tenant, I am dou... [15:44:18] MichaelG_WMDE, RhinosF1Should now be fixed. [15:44:32] Bleh, auto-complete fixery. [15:44:42] (03Merged) 10jenkins-bot: Add performer_pageview_id & performer_is_bot to wikifunctions.ui stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935536 (https://phabricator.wikimedia.org/T338005) (owner: 10David Martin) [15:45:05] @James_F Thank you! [15:45:09] James_F: thanks :) [15:45:18] (03PS5) 10Arturo Borrero Gonzalez: wmcs: cloud_private_subnet: introduce per-rack vlan_id support [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) [15:45:18] HTH. [15:45:54] As far as I can see, oldest patch is 3 hours ago [15:55:14] !log re-enabled puppet in all cp- hosts (done @2023-07-05 14:22:57 UTC) [15:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:35] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/output/935725/42287/" [puppet] - 10https://gerrit.wikimedia.org/r/935725 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [15:57:35] (03PS5) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) [15:58:34] (03PS1) 10Kamila Součková: [WIP] add all the things [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [15:59:28] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10Jclark-ctr) @BTullis would you be able to shutdown server for tomorrow morning 8:30am est [16:00:09] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341035 (10Jclark-ctr) a:03Jclark-ctr [16:00:27] (03PS1) 10Elukey: changeprop: increase the linger.ms value [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) [16:00:29] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10Jclark-ctr) a:03Jclark-ctr [16:02:05] (03PS2) 10Kamila Součková: [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [16:06:51] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: GitLab minor version upgrade [16:15:48] (03Abandoned) 10Jaime Nuche: releases-jenkins: block LDAP users page [puppet] - 10https://gerrit.wikimedia.org/r/935453 (https://phabricator.wikimedia.org/T341074) (owner: 10Jaime Nuche) [16:15:58] (03CR) 10Hnowlan: [C: 03+1] "Seems like a reasonable start - 20 seems like a bit increase but given that we're talking about ms I doubt it's much of an issue. Needs a " [deployment-charts] - 10https://gerrit.wikimedia.org/r/935772 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [16:16:46] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:19:16] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:19:59] (03CR) 10Arturo Borrero Gonzalez: Add a new nftables::service define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:25:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-test-worker1003.eqiad.wmnet [16:25:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host an-test-worker1003.eqiad.wmnet [16:33:27] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: GitLab minor version upgrade [16:51:25] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große) [16:59:35] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935779 (https://phabricator.wikimedia.org/T341129) [16:59:46] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935779 (https://phabricator.wikimedia.org/T341129) (owner: 10Kosta Harlan) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T1700) [17:00:40] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/935779 (https://phabricator.wikimedia.org/T341129) (owner: 10Kosta Harlan) [17:02:54] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [17:03:25] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [17:04:58] (03PS1) 10Kosta Harlan: ipoid: Remove APP_CONFIG override [deployment-charts] - 10https://gerrit.wikimedia.org/r/935781 [17:05:08] (03PS3) 10Kamila Součková: [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [17:05:13] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Remove APP_CONFIG override [deployment-charts] - 10https://gerrit.wikimedia.org/r/935781 (owner: 10Kosta Harlan) [17:06:07] (03Merged) 10jenkins-bot: ipoid: Remove APP_CONFIG override [deployment-charts] - 10https://gerrit.wikimedia.org/r/935781 (owner: 10Kosta Harlan) [17:07:38] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [17:07:52] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [17:09:36] (03PS1) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/935782 (https://phabricator.wikimedia.org/T340479) [17:11:11] (03Abandoned) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/935782 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [17:15:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:18:17] (03CR) 10Dzahn: [C: 03+2] releases-jenkins: fix access control [puppet] - 10https://gerrit.wikimedia.org/r/935417 (https://phabricator.wikimedia.org/T338071) (owner: 10Jaime Nuche) [17:20:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:21:42] (03PS6) 10Muehlenhoff: Add a new nftables::service define [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) [17:24:38] (03PS4) 10Kamila Součková: [WIP] add Benthos cache invalidator [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 [17:25:10] (03CR) 10Muehlenhoff: Add a new nftables::service define (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935751 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [17:27:56] (03CR) 10Dzahn: [C: 03+2] "thanks a lot for this fix. I could confirm it works, if I changed my UA to "TweetmemeBot" for example, I get Forbidden, and it goes away w" [puppet] - 10https://gerrit.wikimedia.org/r/935417 (https://phabricator.wikimedia.org/T338071) (owner: 10Jaime Nuche) [17:32:07] (03CR) 10Andrea Denisse: [C: 03+2] webperf: Provision XHGui directly on performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/935512 (owner: 10Krinkle) [17:40:07] 10SRE-swift-storage, 10MediaWiki-Maintenance-system, 10Privacy: commonswiki.uploadstash table has unexpectedly old data - https://phabricator.wikimedia.org/T130478 (10Dzahn) Thanks for closing the other ticket as duplicate. I was merely reporter of an issue though. Since I have nothing to contribute to actua... [17:40:11] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: GitLab minor version upgrade [17:42:30] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:47:06] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:48:57] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) I discussed this with @Volans today and we agreed it would be nice to keep Spicera... [17:51:36] (03PS6) 10Ssingh: P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) [17:54:33] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42292/console" [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [17:55:30] (03CR) 10Ssingh: [V: 03+1] "I finally found the issue! "Resources only in the old catalog" stems from I7577b1b17674c657c9327db492c899cd2fb6a43f, where the profile isn" [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [17:59:30] (03PS1) 10Krinkle: webperf: Add missing PHP memory_limit setting [puppet] - 10https://gerrit.wikimedia.org/r/935787 [17:59:44] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935787 (owner: 10Krinkle) [18:00:05] hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T1800). [18:00:05] hashar and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T1800). [18:03:14] 10SRE-swift-storage, 10Commons, 10MediaWiki-Action-API, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Yann) [18:04:08] 10SRE-swift-storage, 10Commons, 10MediaWiki-Action-API, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Yann) OK, redacted. [18:06:20] (03CR) 10Andrea Denisse: [C: 03+2] webperf: Add missing PHP memory_limit setting [puppet] - 10https://gerrit.wikimedia.org/r/935787 (owner: 10Krinkle) [18:07:11] (03PS1) 10Jdlrobson: Disable the Nearby feature on some sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935790 (https://phabricator.wikimedia.org/T341133) [18:07:13] (03PS1) 10Jdlrobson: Add language button at the top of the Main page of Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935791 (https://phabricator.wikimedia.org/T337666) [18:17:51] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:57] !log disable puppet on A:dns-rec to merge CR 933497 [18:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::recursor: automatically generate resolv.conf for DNS hosts [puppet] - 10https://gerrit.wikimedia.org/r/933497 (https://phabricator.wikimedia.org/T340479) (owner: 10Ssingh) [18:25:33] !log disable puppet on webperf1003 [18:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:53] !log disable puppet on webperf1003 to test PHP memory changes for XHGui [18:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:43] (03PS1) 10Andrea Denisse: Revert "webperf: Add missing PHP memory_limit setting" [puppet] - 10https://gerrit.wikimedia.org/r/935499 [18:36:24] !log re-enable puppet in A:dns-rec to finish merging CR 933497 and run-agent: T340479 [18:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:28] T340479: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 [18:36:52] (03CR) 10CI reject: [V: 04-1] Revert "webperf: Add missing PHP memory_limit setting" [puppet] - 10https://gerrit.wikimedia.org/r/935499 (owner: 10Andrea Denisse) [18:42:45] 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) 05Open→03Resolved a:03ssingh With the two commits above, this data is automatically generated instead of the ma... [18:46:02] (03PS2) 10Andrea Denisse: Revert "webperf: Add missing PHP memory_limit setting" [puppet] - 10https://gerrit.wikimedia.org/r/935499 [18:48:10] (03CR) 10CI reject: [V: 04-1] Revert "webperf: Add missing PHP memory_limit setting" [puppet] - 10https://gerrit.wikimedia.org/r/935499 (owner: 10Andrea Denisse) [18:48:59] (03PS3) 10Andrea Denisse: Revert "webperf: Add missing PHP memory_limit setting" [puppet] - 10https://gerrit.wikimedia.org/r/935499 [18:50:52] (03PS1) 10Andrea Denisse: Revert "webperf: Provision XHGui directly on performance.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/935500 [18:51:05] (03CR) 10CI reject: [V: 04-1] Revert "webperf: Provision XHGui directly on performance.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/935500 (owner: 10Andrea Denisse) [18:57:48] (03PS2) 10Andrea Denisse: Revert "webperf: Provision XHGui directly on performance.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/935500 [19:03:16] (03CR) 10Dzahn: "the way the puppet code is currently written means that there are things that do not get removed by a revert. This is the Apache config sn" [puppet] - 10https://gerrit.wikimedia.org/r/935500 (owner: 10Andrea Denisse) [19:03:21] (03CR) 10Krinkle: [C: 03+1] Revert "webperf: Provision XHGui directly on performance.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/935500 (owner: 10Andrea Denisse) [19:04:10] (03CR) 10Andrea Denisse: [C: 03+2] Revert "webperf: Provision XHGui directly on performance.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/935500 (owner: 10Andrea Denisse) [19:14:34] (03CR) 10Andrea Denisse: [C: 03+2] Revert "webperf: Provision XHGui directly on performance.wikimedia.org" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935500 (owner: 10Andrea Denisse) [19:23:00] (03CR) 10BCornwall: [V: 03+2] "Looking good! Tested on lvs4010 and it performs as expected. Some issues inline." [puppet] - 10https://gerrit.wikimedia.org/r/933398 (https://phabricator.wikimedia.org/T322377) (owner: 10Jbond) [19:40:35] (03PS2) 10Jdlrobson: Update various logos where SVGs are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) [19:41:07] (03PS3) 10Jdlrobson: Update various logos where SVGs are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) [19:51:11] 10SRE, 10Acme-chief, 10Traffic-Icebox: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BCornwall) 05Stalled→03Resolved For lack of a response, I'm going to close this. @Vgutierrez please do re-open if this isn't... [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230705T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] o/ [20:00:55] i can deploy today [20:01:27] (03CR) 10Urbanecm: [C: 03+2] Disable the Nearby feature on some sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935790 (https://phabricator.wikimedia.org/T341133) (owner: 10Jdlrobson) [20:02:18] (03Merged) 10jenkins-bot: Disable the Nearby feature on some sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935790 (https://phabricator.wikimedia.org/T341133) (owner: 10Jdlrobson) [20:03:06] thanks urbanecm [20:03:15] (03CR) 10Urbanecm: [C: 04-1] "logos.php should be edited via logos/config.yaml. This seems to put things out of sync?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:04:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935790|Disable the Nearby feature on some sister projects (T341133)]] [20:04:13] T341133: Disable the Nearby feature on some sister projects - https://phabricator.wikimedia.org/T341133 [20:05:52] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:935790|Disable the Nearby feature on some sister projects (T341133)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:06:38] Jdlrobson: your first patch is at mwdebug1001 -- can you test it there? [20:06:45] (also, see my CR for one of the other patches) [20:07:17] (03CR) 10Urbanecm: [C: 03+2] Add language button at the top of the Main page of Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935791 (https://phabricator.wikimedia.org/T337666) (owner: 10Jdlrobson) [20:07:59] (03Merged) 10jenkins-bot: Add language button at the top of the Main page of Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935791 (https://phabricator.wikimedia.org/T337666) (owner: 10Jdlrobson) [20:08:24] urbanecm: sure. can also fix the language one up [20:09:34] okay nearby one is good to sync [20:11:35] okay, proceeding [20:12:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jhathaway) [20:12:32] (03CR) 10JHathaway: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935755 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [20:14:50] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:15:12] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:17:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935790|Disable the Nearby feature on some sister projects (T341133)]] (duration: 13m 12s) [20:17:26] T341133: Disable the Nearby feature on some sister projects - https://phabricator.wikimedia.org/T341133 [20:19:10] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935791|Add language button at the top of the Main page of Italian Wikivoyage (T337666)]] [20:19:13] T337666: Add language button at the top of the Main page of Italian Wikivoyage - https://phabricator.wikimedia.org/T337666 [20:20:46] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:935791|Add language button at the top of the Main page of Italian Wikivoyage (T337666)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:21:09] Jdlrobson: second patch is on mwdebug. can you test? [20:25:21] looking [20:25:37] urbanecm: yeh that's good to sync! [20:26:12] syncing [20:27:08] (03CR) 10JHathaway: [C: 03+1] "looks good overall" [puppet] - 10https://gerrit.wikimedia.org/r/933608 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [20:27:47] (03CR) 10Dzahn: ci/zuul: switch gearman server from contint2001 to contint2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867705 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [20:29:10] Jdlrobson: what about the last patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/933691)? Is it possible for you to improve it, or should we reschedule it? [20:29:27] urbanecm: i'm almost done .. it's just very manual and tedious [20:29:37] our approach to logos really needs revisiting [20:29:40] true [20:29:43] i'll wait :) [20:31:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935791|Add language button at the top of the Main page of Italian Wikivoyage (T337666)]] (duration: 12m 38s) [20:31:52] T337666: Add language button at the top of the Main page of Italian Wikivoyage - https://phabricator.wikimedia.org/T337666 [20:31:55] (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:32:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:35:10] (03PS4) 10Jdlrobson: Update various logos where SVGs are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) [20:35:12] urbanecm: okay pushed [20:35:15] (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:35:18] that took about twice as long as the manual edit [20:35:29] (03PS5) 10Urbanecm: Update various logos where SVGs are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:36:33] sorry for making you do it. i didn't want it to be lost when someone regenerates the file though [20:36:38] (03CR) 10Urbanecm: [C: 03+2] Update various logos where SVGs are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:36:46] no thanks for pointing that out. It's been a while [20:37:21] (03Merged) 10jenkins-bot: Update various logos where SVGs are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933691 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:37:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:43:59] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:933691|Update various logos where SVGs are available (T338162)]] [20:44:02] T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162 [20:45:38] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:933691|Update various logos where SVGs are available (T338162)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:45:54] Jdlrobson: it's on mwdebug now, please test too :) [20:46:46] urbanecm: looking [20:48:05] (03CR) 10Milimetric: [C: 04-1] replicas: redact revdeleted, oversighted information (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [20:49:01] urbanecm: okay this all looks good but I've noticed one thing that didn't work - I can either fix that now in a follow up or do it in tomorrow's backport. Not a blocker for syncing this one [20:49:14] i can sync one more patch, no problem. syncing now. [20:51:24] (03PS1) 10Jdlrobson: Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935809 (https://phabricator.wikimedia.org/T338162) [20:51:28] urbanecm: great ^ there's the next one [20:51:49] (03PS2) 10Urbanecm: Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935809 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:51:52] (03CR) 10Urbanecm: [C: 03+2] Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935809 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:52:34] (03Merged) 10jenkins-bot: Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935809 (https://phabricator.wikimedia.org/T338162) (owner: 10Jdlrobson) [20:55:10] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:933691|Update various logos where SVGs are available (T338162)]] (duration: 11m 10s) [20:55:13] T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162 [20:55:38] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:935809|Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct (T338162)]] [20:57:11] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:935809|Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct (T338162)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:57:26] Jdlrobson: it can be tested now. can you have a look? [20:57:33] urbanecm: looking! [20:57:43] urbanecm: perfect! [20:58:11] great! syncing [21:02:10] urbanecm: thanks for your help today! [21:02:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:02:25] no problem. it's not yet fully in prod though [21:04:01] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:935809|Optimize SVG wordmarks, enable Wikimania wordmark, fix techconduct (T338162)]] (duration: 08m 22s) [21:04:04] T338162: Track which Vector 2022 logos are in production vs Google Drive - https://phabricator.wikimedia.org/T338162 [21:04:29] Jdlrobson: and deployed now. [21:04:31] anything else? [21:05:05] urbanecm: nope not for today... possibly tomorrow :) [21:05:20] sounds good :) [21:05:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:09:57] (03CR) 10Milimetric: [C: 04-1] replicas: redact revdeleted, oversighted information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/935752 (https://phabricator.wikimedia.org/T339037) (owner: 10Samuel (WMF)) [21:20:03] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Peachey88) Its a much wider scope of works above compared to the simple quick fix, but have we ever looked at alternative ircds lately? From vague memories a... [21:23:40] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:25:56] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10Dzahn) >>! In T341097#8991983, @Peachey88 wrote: > Its a much wider scope of works above compared to the simple quick fix, but have we ever looked at alterna... [21:51:38] (03PS1) 10Krinkle: webperf: Provision XHGui directly on performance.wikimedia.org (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/935501 [22:07:33] (03CR) 10Andrea Denisse: [C: 03+2] webperf: Provision XHGui directly on performance.wikimedia.org (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/935501 (owner: 10Krinkle) [22:11:14] (03PS1) 10Urbanecm: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) [22:12:26] (03CR) 10CI reject: [V: 04-1] Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [22:15:19] (03PS2) 10Urbanecm: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) [22:15:56] (03CR) 10CI reject: [V: 04-1] Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [22:16:02] what did i wrong now... [22:17:15] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts xhgui1002 [22:17:39] (03PS3) 10Urbanecm: Enable global abuse filters on almost all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) [22:22:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:23:23] !log registry1003 - sudo systemctl start build-hompage [22:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:02] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [22:27:22] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: xhgui1002 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [22:28:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: xhgui1002 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [22:28:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:28:19] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts xhgui1002 [22:28:44] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts xhgui2002 [22:33:02] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [22:33:48] (03PS1) 10Andrea Denisse: xhgui: Decommission xhgui1002 and xhgui2002 hosts to deploy xhgui in webperf1003 [puppet] - 10https://gerrit.wikimedia.org/r/935816 (https://phabricator.wikimedia.org/T341160) [22:35:02] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: xhgui2002 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [22:35:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-test-worker1003.eqiad.wmnet [22:36:12] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: xhgui2002 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [22:36:12] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:36:13] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts xhgui2002 [22:36:43] (03Abandoned) 10Andrea Denisse: Revert "webperf: Add missing PHP memory_limit setting" [puppet] - 10https://gerrit.wikimedia.org/r/935499 (owner: 10Andrea Denisse) [22:37:39] (03CR) 10Andrea Denisse: "Hello!! The Bookworm XHGui hosts are no longer required as the webperf hosts will host XHGui from now on." [puppet] - 10https://gerrit.wikimedia.org/r/935816 (https://phabricator.wikimedia.org/T341160) (owner: 10Andrea Denisse) [22:38:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host an-test-worker1003.eqiad.wmnet [22:50:54] (03CR) 10Dzahn: [C: 03+1] xhgui: Decommission xhgui1002 and xhgui2002 hosts to deploy xhgui in webperf1003 [puppet] - 10https://gerrit.wikimedia.org/r/935816 (https://phabricator.wikimedia.org/T341160) (owner: 10Andrea Denisse) [22:51:50] (03CR) 10Andrea Denisse: [C: 03+2] xhgui: Decommission xhgui1002 and xhgui2002 hosts to deploy xhgui in webperf1003 [puppet] - 10https://gerrit.wikimedia.org/r/935816 (https://phabricator.wikimedia.org/T341160) (owner: 10Andrea Denisse) [22:52:14] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:53:09] (03CR) 10Majavah: Enable global abuse filters on almost all projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935815 (https://phabricator.wikimedia.org/T341159) (owner: 10Urbanecm) [22:57:33] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission xhgui2002 - https://phabricator.wikimedia.org/T341161 (10andrea.denisse) a:05andrea.denisse→03None [22:58:52] 10SRE, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server: Spam in PMs on IRC recent changes server - https://phabricator.wikimedia.org/T341097 (10stwalkerster) I've been having... fun... getting ratbox to compile on my Ubuntu focal desktop for testing. I've had to do some changes to `includes/memory.h`... [23:00:42] 10ops-eqiad, 10decommission-hardware: decommission xhgui1002 - https://phabricator.wikimedia.org/T341160 (10andrea.denisse) [23:01:17] 10ops-eqiad, 10decommission-hardware: decommission xhgui1002 - https://phabricator.wikimedia.org/T341160 (10andrea.denisse) a:05andrea.denisse→03None [23:55:21] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Switch off unused default receivers and ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/934420 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [23:56:08] (03Merged) 10jenkins-bot: opentelemetry-collector: Switch off unused default receivers and ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/934420 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus)