[00:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:37:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014050 [00:37:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014050 (owner: 10TrainBranchBot) [00:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:54:30] (03PS20) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:55:39] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:55:59] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:57:00] (03PS21) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:57:53] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:58:01] (03PS22) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:58:40] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:58:44] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [00:59:49] (03PS23) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [01:00:37] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [01:01:40] (03CR) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [01:02:34] (03PS24) 10Krinkle: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [01:03:47] 06SRE, 10Observability-Logging: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940#9660182 (10andrea.denisse) We received a similar alert today: `(SystemdUnitFailed) firing: logrotate.service on logstash2003:9100` Systemd service status: `error opening /var/log/... [01:05:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014050 (owner: 10TrainBranchBot) [01:05:04] !log Starting logrotate.service on logstash2003 [01:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:32] !log Starting logrotate.service on logstash2003 - T153940 [01:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:40] T153940: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940 [01:08:15] 06SRE, 10Observability-Logging: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940#9660184 (10andrea.denisse) The `production-elk7-codfw.log` file was present in the system. After verifying the contents of the file looked correct I manually started the service an... [01:08:59] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T360972 (10phaultfinder) 03NEW [01:10:47] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Communications" "Wikimedia Foundation/Communications" "Zabe" --reason "per request [[:phab:T360970|T360970]]" # T360970 [01:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:51] T360970: Request to move translatable page on Meta-Wiki: Communications - https://phabricator.wikimedia.org/T360970 [01:45:54] (03CR) 10Ssingh: [C:03+1] "Hi: It's been a while since this CR and then the Summit happened. From Traffic's end, I wanted to share that we have finalized the transit" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T0200) [02:01:20] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9660226 (10tstarling) PhpRedis is getting behind KeyDB with [[https://github.com/phpredis/phpredis/issues/2466|#2466]] an... [02:07:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.24 [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014051 (https://phabricator.wikimedia.org/T360156) [02:07:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.42.0-wmf.24 [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014051 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [02:27:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.24 [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014051 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [02:37:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T0300) [03:02:46] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:39] !log mwpresync@deploy1002 Pruned MediaWiki: 1.42.0-wmf.21 (duration: 03m 33s) [03:05:07] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014150 (https://phabricator.wikimedia.org/T360156) [03:05:09] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014150 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [03:05:51] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014150 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [03:06:21] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.42.0-wmf.24 refs T360156 [03:06:25] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [03:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:56:52] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1013286 (owner: 10L10n-bot) [05:24:49] !log dancy@deploy1002 Started scap: testwikis wikis to 1.42.0-wmf.24 refs T360156 [05:24:54] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [05:51:25] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:51:41] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.42.0-wmf.24 refs T360156 (duration: 26m 52s) [05:51:46] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [05:57:49] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T0600). [06:02:49] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:07:49] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:57] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:10:43] (VarnishUnavailable) firing: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:10:44] (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:11:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 11.1% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:12:15] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:13:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [06:13:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200: 84.5839037598653s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:13:21] enWS and enWP are down for me. [06:13:24] "upstream connect error or disconnect/reset before headers. reset reason: overflow" [06:13:45] "Request from 141.0.239.148 via cp3066.esams.wmnet, ATS/9.1.4 [06:13:45] Error: 502, Broken pipe at 2024-03-26 06:12:22 GMT" [06:14:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad api_appserver GET/200: 0.41981656238412324s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatency [06:14:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:14:15] (MediaWikiLatencyExceeded) firing: (2) p75 latency high: eqiad mw-api-ext (k8s) 1.051s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:14:34] (ProbeDown) firing: (28) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:14:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:14:57] (ProbeDown) firing: (28) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:15:43] (VarnishUnavailable) resolved: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:15:51] Ah, but now it's back up. [06:16:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 27.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:17:15] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:18:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [06:18:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200: 0.4505921831561039s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [06:19:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad api_appserver GET/200: ... [06:19:15] 0.20929432787262256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:19:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:19:20] (MediaWikiLatencyExceeded) resolved: (2) p75 latency high: eqiad mw-api-ext (k8s) 1.051s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:19:34] (ProbeDown) resolved: (28) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:19:40] <_joe_> xover: yeah things should be ok now though? [06:19:51] (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:20:04] _joe_: 👍🏻 [06:20:11] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:44] (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:24:34] (ProbeDown) firing: (30) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:29:34] (ProbeDown) firing: (30) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:30:11] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:34:34] (ProbeDown) resolved: (23) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:04] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:57] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:41:16] (03PS2) 10KartikMistry: Enable ContentTranslation by default for myvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013084 (https://phabricator.wikimedia.org/T353510) [06:42:19] (03PS2) 10KartikMistry: Update cxserver to 2024-03-21-114859-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013273 (https://phabricator.wikimedia.org/T353510) [06:42:49] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:43:11] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:11] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:51:25] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:46] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:22:57] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [07:57:41] (03PS1) 10Giuseppe Lavagetto: misc-frontend: also ban abusers from phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/1014420 [07:59:14] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1014420 (owner: 10Giuseppe Lavagetto) [07:59:23] (03CR) 10Giuseppe Lavagetto: [C:03+2] misc-frontend: also ban abusers from phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/1014420 (owner: 10Giuseppe Lavagetto) [08:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:25] * kart_ is here.. [08:02:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013084 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:03:11] (03Merged) 10jenkins-bot: Enable ContentTranslation by default for myvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013084 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:03:57] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1013084|Enable ContentTranslation by default for myvwiki (T353510)]] [08:04:02] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:06:33] !log kartik@deploy1002 kartik: Backport for [[gerrit:1013084|Enable ContentTranslation by default for myvwiki (T353510)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:08:43] !log kartik@deploy1002 kartik: Continuing with sync [08:18:45] !log deleting AQS codfw VIP (10.2.1.12/32) from Netbox - T358793 [08:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:50] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [08:19:32] !log deleting AQS eqiad VIP (10.2.2.12/32) from Netbox - T358793 [08:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:46] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1013084|Enable ContentTranslation by default for myvwiki (T353510)]] (duration: 15m 48s) [08:19:49] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:20:18] Config deployment is done, will do cxserver deployment (minor) [08:21:20] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-03-21-114859-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013273 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:22:19] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-21-114859-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013273 (https://phabricator.wikimedia.org/T353510) (owner: 10KartikMistry) [08:23:50] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:24:12] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:24:58] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:24:58] (03CR) 10Brouberol: deployment_server: Add redis misc instances to external_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [08:25:13] (03CR) 10Brouberol: [C:03+1] "Let's gooo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014024 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [08:25:30] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:28:13] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:28:45] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:31:55] !log I'm going to apply kafka log compaction for {eqiad,codfw}.mediawiki.currussearch.page_rerender.v1 on kafka-main-codfw only (current replica) - T354794 [08:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:00] T354794: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 [08:33:03] (03PS3) 10Winston Sung: zhwikivoyage: Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013632 (https://phabricator.wikimedia.org/T360175) (owner: 10S8321414) [08:36:12] (03CR) 10Winston Sung: zhwikivoyage: Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013632 (https://phabricator.wikimedia.org/T360175) (owner: 10S8321414) [08:38:13] May I request Gerrit change 1013632 be deployed?Thanks. [08:38:29] May I request Gerrit change 1013632 be deployed? Thanks. [08:39:29] Winston_Sung: sorry I missed the deployment window. I will do it [08:40:04] Appreciated. [08:41:03] (03CR) 10David Caro: [C:03+1] P:toolforge: drop grid shutdown from MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1014118 (owner: 10Majavah) [08:41:30] !log Updated cxserver to 2024-03-21-114859-production (T353510) [08:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:35] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [08:41:53] !log depooling and restarting blazegraph on wdqs1013 (stuck for 2 days) [08:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:20] inflatador, ryankemper: ^ [08:42:23] (03CR) 10Jelto: [C:03+1] "lgtm, the httpbb can be moved to the k8s miscweb yaml file as well. But that can happen in another change." [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [08:42:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013632 (https://phabricator.wikimedia.org/T360175) (owner: 10S8321414) [08:43:39] (03Merged) 10jenkins-bot: zhwikivoyage: Enable NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013632 (https://phabricator.wikimedia.org/T360175) (owner: 10S8321414) [08:43:57] (03PS1) 10Hashar: httpbb: raise timeout for Barack Obama [puppet] - 10https://gerrit.wikimedia.org/r/1014425 (https://phabricator.wikimedia.org/T360867) [08:44:08] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1013632|zhwikivoyage: Enable NewUserMessage extension (T360175)]] [08:44:12] T360175: Enable NewUserMessage extension for zhwikivoyage - https://phabricator.wikimedia.org/T360175 [08:46:08] oh there is a #chinese-sites project in Phabricator, that is great [08:46:38] !log hashar@deploy1002 hashar and s8321414: Backport for [[gerrit:1013632|zhwikivoyage: Enable NewUserMessage extension (T360175)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:46:58] Winston_Sung: it is on the debug servers if you know how to check that? :) [08:47:21] Yeah, I know. [08:47:31] (ProbeDown) resolved: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:47:46] oh firefox shows chinese characters in my url bar now! `returnto=首页` [08:48:02] But I'm not sure whether I had to create a new account to test it or not. [08:48:25] *need [08:48:26] the extension itself should work [08:48:35] (03PS1) 10Jcrespo: admin: Add GeorgeMikesell's production access [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) [08:48:35] it is merely to verify the special page exists / is enabled [08:48:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:49:30] (03CR) 10CI reject: [V:04-1] admin: Add GeorgeMikesell's production access [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) (owner: 10Jcrespo) [08:49:37] Ok. Let me check. It's 1002, right? [08:49:52] yeah [08:50:01] or mwdebug1001 [08:50:12] the patch is on both [08:50:39] (03CR) 10Joal: [C:03+1] Update the from address of all email from refinery jobs. [puppet] - 10https://gerrit.wikimedia.org/r/1014001 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [08:51:22] Yeah, I can see it in Special:Version. [08:51:30] oh of course, I should have checked that [08:51:46] then there are a bunch of messages that would need to be tweaked https://www.mediawiki.org/wiki/Extension:NewUserMessage#In-wiki_configuration :) [08:51:50] !log hashar@deploy1002 hashar and s8321414: Continuing with sync [08:53:02] (03PS2) 10Jcrespo: admin: Add GeorgeMikesell's production access [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) [08:53:16] Looks good to me: [08:53:18] zh.wikivoyage.org/wiki/Template:Welcome [08:53:36] https://zh.wikivoyage.org/wiki/Template:Welcome [08:55:21] (03CR) 10Jcrespo: "@GeorgeMikesell, could you double check the patch? In particular, the key had a spelling error, as well as the login name." [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) (owner: 10Jcrespo) [09:00:00] Winston_Sung: great! [09:00:38] We could sync it to the production. [09:01:15] Oh, browser cache. [09:01:19] Nevermind. [09:02:37] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1013632|zhwikivoyage: Enable NewUserMessage extension (T360175)]] (duration: 18m 29s) [09:02:41] T360175: Enable NewUserMessage extension for zhwikivoyage - https://phabricator.wikimedia.org/T360175 [09:02:48] Winston_Sung: done!! :) [09:04:23] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-tool1010.eqiad.wmnet [09:04:34] Thanks. [09:05:38] @hashar: Much appreciated. [09:06:13] Winston_Sung: and thank you to have poked the channel or I would have missed it for sure :) [09:10:28] (03PS1) 10Brouberol: Decommission an-tool1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1014432 (https://phabricator.wikimedia.org/T353782) [09:11:28] (03PS2) 10Brouberol: Decommission an-tool1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1014432 (https://phabricator.wikimedia.org/T353782) [09:12:45] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [09:19:14] (03PS5) 10Urbanecm: Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) [09:19:34] jouncebot: nowandnext [09:19:34] No deployments scheduled for the next 1 hour(s) and 40 minute(s) [09:19:34] In 1 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1100) [09:20:10] (03PS1) 10Slyngshede: Add reminder that emails are public. [software/bitu] - 10https://gerrit.wikimedia.org/r/1014434 (https://phabricator.wikimedia.org/T360888) [09:20:20] (03CR) 10JMeybohm: [V:03+1] deployment_server: Add redis misc instances to external_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [09:22:33] (03PS11) 10JMeybohm: deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [09:24:42] (03PS1) 10Brouberol: external-services: ensure rendering idempotence by sorting services and IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014436 (https://phabricator.wikimedia.org/T331894) [09:26:03] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [09:26:36] (03PS1) 10Slyngshede: Fix documentation link. [software/bitu] - 10https://gerrit.wikimedia.org/r/1014437 (https://phabricator.wikimedia.org/T360635) [09:31:19] (03CR) 10JMeybohm: [C:03+1] external-services: ensure rendering idempotence by sorting services and IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014436 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:33:59] (03CR) 10Brouberol: [C:03+2] external-services: ensure rendering idempotence by sorting services and IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014436 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:36:11] (03PS12) 10JMeybohm: deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [09:36:18] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9660652 (10fnegri) [09:37:01] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:37:26] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:38:24] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:38:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:38:42] (03CR) 10Jelto: [C:03+2] gitlab_runner: unregister gitlab-runner2004 for dockerfile conversion [puppet] - 10https://gerrit.wikimedia.org/r/1014005 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [09:39:55] (03CR) 10Urbanecm: [C:03+2] Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [09:40:41] (03PS6) 10Urbanecm: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) [09:40:43] (03Merged) 10jenkins-bot: Add CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013608 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [09:40:44] (03CR) 10Urbanecm: [C:03+2] Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [09:40:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [09:41:33] (03Merged) 10jenkins-bot: Add wmgUseCommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013609 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [09:42:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1013608|Add CommunityConfiguration extension (T357766)]], [[gerrit:1013609|Add wmgUseCommunityConfiguration (T357766)]] [09:42:07] T357766: Deploy Community configuration to beta wiki - https://phabricator.wikimedia.org/T357766 [09:42:14] (03PS1) 10TChin: Add datasets-config namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014439 (https://phabricator.wikimedia.org/T357434) [09:50:18] (03PS1) 10Brouberol: aqs: remove conftool data and envoy listener [puppet] - 10https://gerrit.wikimedia.org/r/1014441 [09:50:51] (03PS2) 10Brouberol: aqs: remove conftool data and envoy listener [puppet] - 10https://gerrit.wikimedia.org/r/1014441 (https://phabricator.wikimedia.org/T358793) [09:50:58] (03CR) 10Majavah: [C:03+2] P:toolforge: drop grid shutdown from MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1014118 (owner: 10Majavah) [09:51:12] (03CR) 10Majavah: [C:03+2] P:toolforge::legacy_redirector: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014026 (https://phabricator.wikimedia.org/T311909) (owner: 10Majavah) [09:52:02] (03PS13) 10JMeybohm: deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [09:54:54] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014052 [09:55:05] (03CR) 10JMeybohm: [C:03+1] aqs: remove conftool data and envoy listener [puppet] - 10https://gerrit.wikimedia.org/r/1014441 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [09:56:33] (03CR) 10Brouberol: [C:03+2] aqs: remove conftool data and envoy listener [puppet] - 10https://gerrit.wikimedia.org/r/1014441 (https://phabricator.wikimedia.org/T358793) (owner: 10Brouberol) [09:56:41] dcausse: re wdqs1013 – are you sure it’s depooled? wikidata maxlag has apparently skyrocketed (https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?viewPanel=12) and I don’t see any other lagged server in https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8 [09:56:52] and from T238751 it sounds like depooled servers should not be factored into the maxlag calculation anymore [09:56:52] T238751: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 [09:56:54] Lucas_WMDE: looking [09:56:57] (though I’m not sure if that’s working properly or not) [09:56:58] thanks [09:57:19] * Lucas_WMDE dimly recalls that there were two depooling mechanisms and for maxlag we were querying one of them, maybe? [09:57:19] (03PS14) 10JMeybohm: deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) [09:57:46] jouncebot: nowandnext [09:57:46] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [09:57:46] In 1 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1100) [09:59:01] (03PS1) 10TChin: dse-k8s: add datasets-config namespace [puppet] - 10https://gerrit.wikimedia.org/r/1014443 (https://phabricator.wikimedia.org/T357434) [10:00:01] hm seeing 'eqiad/wdqs/wdqs/wdqs1013.eqiad.wmnet: pooled changed yes => no' so it was properly depooled but indeed https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=41 suggests that it's still pooled ... [10:00:38] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1721/co" [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:00:40] apparently maxlag was not affected while the server was “stuck for 2 days” [10:00:57] it feels like there might be two sources of truth for whether a server is pooled or not [10:01:12] (03CR) 10JMeybohm: [V:03+1] deployment_server: Add redis misc instances to external_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:01:13] and the automatic depooling of the lagged server (is that a thing now?) was taken into account for maxlag [10:01:17] but the manual one for the restart wasn’t? [10:01:20] idk [10:01:21] the prometheus query uses the query rate as a proxy for the pooling status [10:01:27] hm [10:01:31] that sounds like it should work [10:01:44] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [10:01:51] perhaps something is still hitting this machine, looking [10:02:02] (maybe the “two sources of truth” I’m remembering is *before* you came up with that prometheus query) [10:02:17] (03CR) 10JMeybohm: [C:03+2] Enable external-services on all wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014024 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [10:02:18] brouberol: you might be able to help (and gain some understanding of WDQS in the process) ^ [10:03:43] Lucas_WMDE: I'm going to investigate a bit if I can't fix the issue I'll simply stop blazegraph and we'll recover from another machine [10:04:03] alright, thanks a lot! [10:04:39] thanks for the ping, btw, I should have been more careful [10:05:06] (03CR) 10Btullis: [C:03+1] "Cool." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014010 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [10:05:23] (03Merged) 10jenkins-bot: Enable external-services on all wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014024 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [10:05:51] (03CR) 10Slyngshede: [C:03+2] Fix documentation link. [software/bitu] - 10https://gerrit.wikimedia.org/r/1014437 (https://phabricator.wikimedia.org/T360635) (owner: 10Slyngshede) [10:08:12] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [10:08:12] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:08:13] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-tool1010.eqiad.wmnet [10:09:28] (03PS11) 10Gmodena: Add webrequest.frontend.rc0 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:09:43] (03CR) 10Gmodena: [C:03+1] Add webrequest.frontend.rc0 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:10:00] (03CR) 10Gmodena: [C:03+1] Add webrequest.frontend.rc0 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:12:26] (03CR) 10Brouberol: [C:03+1] deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:13:28] (03CR) 10JMeybohm: [V:03+1 C:03+2] deployment_server: Add redis misc instances to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1013971 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:13:35] seems like the threshold is too low... rate > 1 is giving false positives, rate > 1.3 is better, seems fragile... [10:13:42] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1013608|Add CommunityConfiguration extension (T357766)]], [[gerrit:1013609|Add wmgUseCommunityConfiguration (T357766)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:13:48] T357766: Deploy Community configuration to beta wiki - https://phabricator.wikimedia.org/T357766 [10:13:48] !log urbanecm@deploy1002 urbanecm: Continuing with sync [10:15:32] (03PS5) 10Urbanecm: [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) [10:15:37] (03PS5) 10Urbanecm: [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) [10:16:58] (03CR) 10Urbanecm: [C:03+2] [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:17:42] (03Merged) 10jenkins-bot: [beta] eswiki: Enable CommunityConfiguration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013610 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:18:03] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:18:12] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:18:39] !log stopping blazegraph on wdqs1013, (wdqs->wikidata maxlag propagation not working as expected) [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:48] (03PS1) 10Brouberol: external-services: let /health requests get responded by Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014467 (https://phabricator.wikimedia.org/T356484) [10:22:18] Lucas_WMDE: maxlag should be ok, filing a task, problem is the threshold on the query rate that is too fragile, we probably have some monitoring queries that add too much noise on this metric [10:22:51] 06SRE, 06Infrastructure-Foundations, 10MediaWiki-Email, 10observability: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9660755 (10jcrespo) [10:23:31] 06SRE, 06serviceops, 07Epic: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9660756 (10jcrespo) [10:23:59] 06SRE, 06Infrastructure-Foundations, 10Maps: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9660757 (10jcrespo) [10:24:07] (03CR) 10Btullis: [C:03+1] "nit: Commit title states external-services instead of superset." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014467 (https://phabricator.wikimedia.org/T356484) (owner: 10Brouberol) [10:24:33] (03CR) 10Btullis: [V:03+1 C:03+2] Update the from address of all email from refinery jobs. [puppet] - 10https://gerrit.wikimedia.org/r/1014001 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [10:24:39] (03PS2) 10Brouberol: superset: let /health requests get responded by Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014467 (https://phabricator.wikimedia.org/T356484) [10:24:50] (03CR) 10Brouberol: "oops, fixed, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014467 (https://phabricator.wikimedia.org/T356484) (owner: 10Brouberol) [10:26:07] (03CR) 10Brouberol: [C:03+2] spark-history: bypass Kerberos principal hostname reverse DNS check for namenode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014010 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [10:26:22] (03CR) 10Brouberol: [C:03+2] superset: let /health requests get responded by Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014467 (https://phabricator.wikimedia.org/T356484) (owner: 10Brouberol) [10:26:39] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1014103 (owner: 10EoghanGaffney) [10:28:14] (03PS1) 10JMeybohm: deployment_server: Fix structure for redis misc external services [puppet] - 10https://gerrit.wikimedia.org/r/1014474 (https://phabricator.wikimedia.org/T360612) [10:28:56] (03PS1) 10Slyngshede: Clear system field on deactivation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1014475 (https://phabricator.wikimedia.org/T359533) [10:28:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:29:25] (SystemdUnitFailed) firing: wdqs-updater.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:39] expected ^ [10:30:52] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [10:31:01] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1013608|Add CommunityConfiguration extension (T357766)]], [[gerrit:1013609|Add wmgUseCommunityConfiguration (T357766)]] (duration: 48m 57s) [10:31:06] T357766: Deploy Community configuration to beta wiki - https://phabricator.wikimedia.org/T357766 [10:31:18] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9660775 (10jcrespo) I am going to remove the #SRE-Access-Requests because, while it is indeed an access request, it is not immediately actionable by people on clinic duty, but has to be discus... [10:31:20] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [10:31:42] (03CR) 10Urbanecm: [C:03+2] [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:31:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:31:57] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1722/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014474 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:32:06] (03CR) 10Btullis: [C:03+2] dse-k8s: add datasets-config namespace [puppet] - 10https://gerrit.wikimedia.org/r/1014443 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [10:32:36] (03Merged) 10jenkins-bot: [beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013611 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:32:59] (03CR) 10Btullis: [C:03+2] Add datasets-config namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014439 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [10:33:04] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [10:33:04] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1013610|[beta] eswiki: Enable CommunityConfiguration extension (T357766)]], [[gerrit:1013611|[beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments (T357766)]] [10:33:32] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [10:33:40] (03CR) 10Jcrespo: "Note: May require NDA or WMF LDAP group addition before merging, as well as kerberos addition afterwards." [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) (owner: 10Jcrespo) [10:34:39] (03CR) 10Brouberol: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1014474 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:35:09] (03CR) 10JMeybohm: [V:03+1 C:03+2] deployment_server: Fix structure for redis misc external services [puppet] - 10https://gerrit.wikimedia.org/r/1014474 (https://phabricator.wikimedia.org/T360612) (owner: 10JMeybohm) [10:35:43] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1013610|[beta] eswiki: Enable CommunityConfiguration extension (T357766)]], [[gerrit:1013611|[beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments (T357766)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:35:48] !log urbanecm@deploy1002 urbanecm: Continuing with sync [10:35:58] (03CR) 10Slyngshede: [C:03+2] Clear system field on deactivation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1014475 (https://phabricator.wikimedia.org/T359533) (owner: 10Slyngshede) [10:36:17] (03Merged) 10jenkins-bot: Add datasets-config namespace to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014439 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [10:37:37] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [10:38:04] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [10:39:04] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:40:51] (03PS1) 10Urbanecm: [beta] eswiki: Fix CommunityConfiguration config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014480 (https://phabricator.wikimedia.org/T357766) [10:41:12] (03CR) 10Urbanecm: [C:03+2] [beta] eswiki: Fix CommunityConfiguration config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014480 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:41:57] (03Merged) 10jenkins-bot: [beta] eswiki: Fix CommunityConfiguration config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014480 (https://phabricator.wikimedia.org/T357766) (owner: 10Urbanecm) [10:43:01] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:46:46] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1013610|[beta] eswiki: Enable CommunityConfiguration extension (T357766)]], [[gerrit:1013611|[beta] eswiki: Use CommunityConfiguration extension for GrowthExperiments (T357766)]] (duration: 13m 41s) [10:46:50] T357766: Deploy Community configuration to beta wiki - https://phabricator.wikimedia.org/T357766 [10:52:30] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:55] (03PS1) 10Driedmueller: Dont recalculate winners from scratch each round [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) [10:58:25] (03CR) 10Hashar: Merge tag 'v3.8.4' into wmf/stable-3.8 (031 comment) [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [10:59:25] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1100) [11:00:05] claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:02:42] (03PS1) 10Driedmueller: Dont recalculate winners from scratch each round [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) [11:04:14] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [11:04:45] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891#9660931 (10jcrespo) @Reedy what did you see as slow back them? Right now doing: * https://lists.wikimedia.org/postorius/lists/?page=49 seems rela... [11:06:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 16 hosts with reason: Maint T343718 [11:06:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:06:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 16 hosts with reason: Maint T343718 [11:09:36] (03PS1) 10Filippo Giunchedi: snmp: instruct libsmi with snmp-mibs-downloader path [puppet] - 10https://gerrit.wikimedia.org/r/1014483 (https://phabricator.wikimedia.org/T359198) [11:10:02] (03CR) 10CI reject: [V:04-1] snmp: instruct libsmi with snmp-mibs-downloader path [puppet] - 10https://gerrit.wikimedia.org/r/1014483 (https://phabricator.wikimedia.org/T359198) (owner: 10Filippo Giunchedi) [11:10:35] (03PS2) 10Filippo Giunchedi: snmp: instruct libsmi with snmp-mibs-downloader path [puppet] - 10https://gerrit.wikimedia.org/r/1014483 (https://phabricator.wikimedia.org/T359198) [11:10:53] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [11:11:18] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [11:11:35] (03Merged) 10jenkins-bot: Merge tag 'v3.8.4' into wmf/stable-3.8 [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1013953 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [11:12:20] (03PS1) 10Jelto: gitlab_runner: allow dockerfile frontend on gitlab-runner2004 [puppet] - 10https://gerrit.wikimedia.org/r/1014485 (https://phabricator.wikimedia.org/T357612) [11:13:55] 06SRE, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9660965 (10larissagaulia) [11:14:52] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1723/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014485 (https://phabricator.wikimedia.org/T357612) (owner: 10Jelto) [11:15:18] !log Stopping puppet on P:restbase to deploy 1005756 - T358213 [11:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:22] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [11:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:47] (03CR) 10Clément Goubert: [C:03+2] restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [11:19:06] (03PS2) 10Cparle: MachineVision extension is sunsetted, stop refining events [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) [11:19:13] !log enabling and running puppet on restbase2021.codfw.wmnet - T358213 [11:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:23:53] dcausse: okay, thanks a lot! [11:24:05] !log enabling and running puppet on restbase1035.eqiad.wmnet - T358213 [11:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:08] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [11:27:09] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014054 [11:31:08] Looks ok [11:33:04] (03PS1) 10Hashar: Gerrit 3.8.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1014488 (https://phabricator.wikimedia.org/T354886) [11:33:21] !log enabling and running puppet on P:restbase - T358213 [11:33:22] (03CR) 10CI reject: [V:04-1] Gerrit 3.8.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1014488 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [11:35:48] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:35:51] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:36:22] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:36:24] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:54:55] (03PS1) 10Btullis: Update the from_address for burrow notification emails [puppet] - 10https://gerrit.wikimedia.org/r/1014491 (https://phabricator.wikimedia.org/T358675) [11:56:31] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1724/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014491 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:56:40] (03CR) 10Btullis: Update the from_address for burrow notification emails [puppet] - 10https://gerrit.wikimedia.org/r/1014491 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:56:46] (03CR) 10Hashar: "recheck after pushing LFS objects" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1014488 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1200) [12:00:55] (03PS2) 10Hashar: Gerrit 3.8.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1014488 (https://phabricator.wikimedia.org/T354886) [12:05:00] (03PS1) 10Clément Goubert: restbase: Migrate backend traffic to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1014493 (https://phabricator.wikimedia.org/T358213) [12:05:59] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:06:20] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:06:40] (03CR) 10Clément Goubert: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1725/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014493 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [12:09:51] (03CR) 10Btullis: [C:03+1] "The code looks good to me in principle, so I'm happy to add my +1." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) (owner: 10Sg912) [12:26:37] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1014483 (https://phabricator.wikimedia.org/T359198) (owner: 10Filippo Giunchedi) [12:30:09] jouncebot: nowandnext [12:30:09] For the next 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1200) [12:30:09] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1300) [12:41:36] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1014113|[officewiki, testwiki]: enable CodeMirrorV6 (T357795)]] [12:41:40] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [12:46:09] !log samtar@deploy1002 musikanimal and samtar: Backport for [[gerrit:1014113|[officewiki, testwiki]: enable CodeMirrorV6 (T357795)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:46:23] * TheresNoTime testing [12:47:39] !log samtar@deploy1002 musikanimal and samtar: Continuing with sync [12:48:53] !log noting that `host='mwdebug2001.codfw.wmnet', port=443): Read timed out.` during scap `check_testservers_baremetal`, retry worked P58919 [12:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:54] TheresNoTime: known issue https://phabricator.wikimedia.org/T360867 I need to get on merging hashar's timeout change [12:51:01] claime: ack, thank you [12:52:09] I also don't know that has never hit us before [12:52:21] I won’t be able to deploy during the backport window btw [12:52:34] Lucas_WMDE: s'ok, I can :) [12:52:40] or maybe that is related to the datacenter switch over when the parsing happens from codfw [12:52:41] hashar: The integration of httpbb in scap is recent, and we don't usually run the httpbb tests on debug servers [12:52:41] \o/ [12:52:53] It has nothing to do with the switchover :) [12:54:15] !log enabling and running puppet on restbase2021.codfw.wmnet - T358213 [12:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:19] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [12:54:21] !log enabling and running puppet on restbase1035.eqiad.wmnet - T358213 [12:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:07] unrelated, is there a reason we don't have a hostname alias for the currently active `mwmaint` ? (like we do for the deployment server I mean) [12:56:31] TheresNoTime: we do? [12:56:59] we do?? [12:57:11] TheresNoTime: it seems like it https://www.irccloud.com/pastebin/RO3FVKlq/ [12:57:56] oops :D [12:58:12] :o TIL [12:58:16] there is no `.codfw.wmnet` alias as far as i can see. i have no idea how this one behaves when we're switched over [12:59:09] huh, I didn’t know deployment.codfw.wmnet existed [12:59:09] nop [12:59:16] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1014113|[officewiki, testwiki]: enable CodeMirrorV6 (T357795)]] (duration: 17m 40s) [12:59:20] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [12:59:25] I think I was told to always use deployment.eqiad.wmnet, and I know for the past half year it referred to the codfw one [12:59:25] the canonical one is `deployment.eqiad.wmnet` [12:59:59] we did ask for canonical entries without the datacenter but that got declined, I can't remember the reason [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1300) [13:00:05] phuedx, Kizule, anzx, and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] so I would expect mwmaint to also point to whichever DC is active [13:00:18] * TheresNoTime can deploy [13:00:18] o/ [13:00:23] s/mwmaint/maintenance/ [13:00:27] o/ [13:00:32] phuedx: starting with yours [13:00:39] But you can do in your ~/.ssh/config: [13:00:39] Host deployment.wmnet [13:00:39] Hostname deployment.eqiad.wmnet [13:00:50] heh, I guess that works yeah [13:00:56] TheresNoTime: ack [13:01:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:01:12] (PS I think wikibugs has given up) [13:02:18] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1013234|Update mediawiki.web_ui_actions stream config (T360955)]] [13:02:22] T360955: Update mediawiki.web_ui_actions Stream Config - https://phabricator.wikimedia.org/T360955 [13:02:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:04:48] !log samtar@deploy1002 phuedx and samtar: Backport for [[gerrit:1013234|Update mediawiki.web_ui_actions stream config (T360955)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:04:59] phuedx: is this testable? [13:05:06] TheresNoTime: Yes. Which server? [13:05:34] phuedx: any :) [13:05:37] Sorry. Daft question :D [13:06:02] always better to check! :D [13:06:14] * Kizule is waving [13:07:00] o/ [13:09:16] TheresNoTime: Confirmed. Tested on 2001. Saw that the stream config is updated in the browser and that events were still sending correctly [13:09:23] ack [13:09:28] !log samtar@deploy1002 phuedx and samtar: Continuing with sync [13:14:22] scap is being slow, but Kizule your patch will be next FYI [13:14:57] Okay, thanks for letting me know. [13:15:22] Do I actually need to be here, as it's IP throttling related?\ [13:16:09] Kizule: nah not really, I'll just sync it :) [13:17:12] I'm asking because I'm waiting for a courier (something like FedEx, but another one that works in Serbia only) and he can call any minute that he's coming and that I have to come outside to pick a shipment. [13:17:18] I'll just wait here, thanks. [13:18:21] (03PS3) 10Samtar: [officewiki, testwiki]: enable CodeMirrorV6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014113 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [13:18:22] (03PS1) 10Hnowlan: wmnet: remove similar-users [dns] - 10https://gerrit.wikimedia.org/r/1014495 (https://phabricator.wikimedia.org/T345274) [13:18:33] oh hi wikibugs [13:18:50] (03PS1) 10Giuseppe Lavagetto: services_proxy: add support for split listeners [puppet] - 10https://gerrit.wikimedia.org/r/1014497 [13:18:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014113 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [13:19:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:19:15] (03Merged) 10jenkins-bot: [officewiki, testwiki]: enable CodeMirrorV6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014113 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [13:19:19] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1726/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014497 (owner: 10Giuseppe Lavagetto) [13:19:47] (03CR) 10Clément Goubert: [C:03+1] services_proxy: add support for split listeners [puppet] - 10https://gerrit.wikimedia.org/r/1014497 (owner: 10Giuseppe Lavagetto) [13:19:51] (03PS1) 10Hnowlan: service: set similar-users to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014499 (https://phabricator.wikimedia.org/T345274) [13:20:07] (03PS6) 10Amire80: planet: add various feeds, reorganize [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [13:20:11] (03PS1) 10Hnowlan: service: remove similar-users from realserver, set service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014500 (https://phabricator.wikimedia.org/T345274) [13:20:22] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1013234|Update mediawiki.web_ui_actions stream config (T360955)]] (duration: 18m 03s) [13:20:23] (03CR) 10Clément Goubert: [C:03+2] services_proxy: add support for split listeners [puppet] - 10https://gerrit.wikimedia.org/r/1014497 (owner: 10Giuseppe Lavagetto) [13:20:26] T360955: Update mediawiki.web_ui_actions Stream Config - https://phabricator.wikimedia.org/T360955 [13:20:42] phuedx: live in prod :) [13:20:51] (03PS4) 10Samtar: Update mediawiki.web_ui_actions stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T360955) (owner: 10Phuedx) [13:20:58] TheresNoTime: Thanks! :) [13:21:05] (03CR) 10Hashar: [C:03+2] Gerrit 3.8.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1014488 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:21:10] grr [13:21:11] (03Merged) 10jenkins-bot: Gerrit 3.8.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1014488 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [13:21:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T360955) (owner: 10Phuedx) [13:21:29] (03Merged) 10jenkins-bot: Update mediawiki.web_ui_actions stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013234 (https://phabricator.wikimedia.org/T360955) (owner: 10Phuedx) [13:21:34] getting pinged by wikibugs in every channel? :p [13:21:39] *as it catches up [13:21:41] no [13:21:44] the delay kills me [13:21:45] (03PS4) 10Samtar: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014073 (https://phabricator.wikimedia.org/T360533) (owner: 10Zoranzoki21) [13:22:22] I am just hoping that someone is annoyed by the newly added latency, files a task about it and whoever maintains it ends up fixing the underlying issue :) [13:22:35] but I guess I will have to file it myself [13:22:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1014073|Add throttle rule for editathon (T360533)]] [13:22:50] T360533: Lift IP cap on 2024-04-06 for Editathon for eswiki and commonswiki - https://phabricator.wikimedia.org/T360533 [13:22:50] (03CR) 10Volans: [C:03+1] "LGTM, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014106 (owner: 10Majavah) [13:23:10] (03CR) 10Hnowlan: [C:03+1] restbase: Migrate backend traffic to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1014493 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [13:23:18] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#9661372 (10Volans) @BTullis indeed, that's another new device type created with the wrong slug. I've updated the slug in Netbox to fix it. [13:23:23] E_TOO_MANY_TASKS [13:23:26] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9661373 (10DBu-WMF) Hey @Dzahn this ticket number does not come up in search and when I add the ticket number to the url I get this message: Access Denied: Unknown Object (Task) This object is in a... [13:23:34] anzx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1012115 is conflicted btw [13:23:46] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9661385 (10RobH) [13:24:02] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9661386 (10RobH) Remote work task is via CS1553796, remote hands has confirmed receipt of the SSDs and work to take place on March 27th @ 11AM CET. [13:24:30] (03PS6) 10Anzx: knwikisource, knwiktionary: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010881 (https://phabricator.wikimedia.org/T360022) [13:24:43] TheresNoTime: checking [13:24:47] ty [13:25:16] !log samtar@deploy1002 samtar and zoranzoki21: Backport for [[gerrit:1014073|Add throttle rule for editathon (T360533)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:25:22] !log samtar@deploy1002 samtar and zoranzoki21: Continuing with sync [13:25:34] Kizule: sync'd, no need to test ^ :) [13:25:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014073 (https://phabricator.wikimedia.org/T360533) (owner: 10Zoranzoki21) [13:26:09] (03Merged) 10jenkins-bot: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014073 (https://phabricator.wikimedia.org/T360533) (owner: 10Zoranzoki21) [13:26:21] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dbprov2005 dns - pt1979@cumin2002" [13:26:23] (03CR) 10Majavah: [C:03+2] k8s: Remove use of @staticmethod in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014106 (owner: 10Majavah) [13:26:43] (03PS1) 10JMeybohm: external-services: remove the service name from port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014505 (https://phabricator.wikimedia.org/T331894) [13:27:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dbprov2005 dns - pt1979@cumin2002" [13:27:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:29] (03Merged) 10jenkins-bot: k8s: Remove use of @staticmethod in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014106 (owner: 10Majavah) [13:30:49] hashar: the "lag" is caused by the fact that the bot is still having some connectivity issues to the redis queue, so when I restart the bot it will start sending older messages from the queue. bd808 is working on making it not get disconnected [13:31:32] so the restart is manually o reconnect to Redis [13:31:40] and that cause the bugs to flush the queued messages [13:31:42] correct? [13:31:46] (03CR) 10Filippo Giunchedi: [C:03+2] snmp: instruct libsmi with snmp-mibs-downloader path [puppet] - 10https://gerrit.wikimedia.org/r/1014483 (https://phabricator.wikimedia.org/T359198) (owner: 10Filippo Giunchedi) [13:32:10] it will start processing the queue again after a restart, yes [13:33:01] (03PS3) 10Anzx: dewiki: Enable mobile page tabs for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012115 (https://phabricator.wikimedia.org/T360246) [13:35:59] (03CR) 10JMeybohm: [C:03+2] external-services: remove the service name from port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014505 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [13:36:26] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1014073|Add throttle rule for editathon (T360533)]] (duration: 13m 40s) [13:36:30] T360533: Lift IP cap on 2024-04-06 for Editathon for eswiki and commonswiki - https://phabricator.wikimedia.org/T360533 [13:36:31] (03PS7) 10Samtar: knwikisource, knwiktionary: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010881 (https://phabricator.wikimedia.org/T360022) (owner: 10Anzx) [13:36:58] anzx: starting with 1010881 [13:37:04] Ok [13:37:35] (03CR) 10Gmodena: [C:03+2] eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051) (owner: 10Gmodena) [13:37:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010881 (https://phabricator.wikimedia.org/T360022) (owner: 10Anzx) [13:38:35] (03Merged) 10jenkins-bot: knwikisource, knwiktionary: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010881 (https://phabricator.wikimedia.org/T360022) (owner: 10Anzx) [13:38:50] (03Merged) 10jenkins-bot: external-services: remove the service name from port names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014505 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [13:38:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:39:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:39:08] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1010881|knwikisource, knwiktionary: update logo, wordmark (T360022)]] [13:39:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:39:14] T360022: Update logo for Kannada Wikisource and Wiktionary - https://phabricator.wikimedia.org/T360022 [13:39:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:39:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58921 and previous config saved to /var/cache/conftool/dbconfig/20240326-133932-ladsgroup.json [13:39:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:41:02] (03CR) 10Gmodena: [V:03+2 C:03+2] eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051) (owner: 10Gmodena) [13:41:06] (03PS1) 10Anzx: dewiki: Enable mobile page tabs for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014509 (https://phabricator.wikimedia.org/T360246) [13:41:14] (03Merged) 10jenkins-bot: eventstreams: change default num_workers to 0. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008446 (https://phabricator.wikimedia.org/T359051) (owner: 10Gmodena) [13:41:17] Thanks TheresNoTime! [13:41:26] np! [13:41:36] !log samtar@deploy1002 anzx and samtar: Backport for [[gerrit:1010881|knwikisource, knwiktionary: update logo, wordmark (T360022)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:49] anzx: on mwdebug [13:41:50] TheresNoTime: testing [13:43:16] (03CR) 10FNegri: [C:03+2] [wmcs-backup] Fix parsing of exclude_volumes [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) (owner: 10FNegri) [13:43:21] TheresNoTime: looks good [13:43:29] !log samtar@deploy1002 anzx and samtar: Continuing with sync [13:43:40] (KubernetesRsyslogDown) firing: rsyslog on mw1483:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1483 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:44:52] (03PS1) 10Majavah: openstack: wmcs-enc-cli: Add get_prefix_roles [puppet] - 10https://gerrit.wikimedia.org/r/1014511 [13:45:01] (03PS9) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [13:45:51] (03PS13) 10Samtar: frwiki: update legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011097 (https://phabricator.wikimedia.org/T359741) (owner: 10Anzx) [13:45:58] (03CR) 10Andrew Bogott: [C:03+1] openstack: wmcs-enc-cli: Add get_prefix_roles [puppet] - 10https://gerrit.wikimedia.org/r/1014511 (owner: 10Majavah) [13:47:46] (03PS10) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [13:47:51] jouncebot: nowandnext [13:47:51] For the next 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1300) [13:47:52] In 1 hour(s) and 12 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1500) [13:48:08] backport window is likely to overrun, any issues? [13:48:09] (03CR) 10Majavah: [C:03+2] openstack: wmcs-enc-cli: Add get_prefix_roles [puppet] - 10https://gerrit.wikimedia.org/r/1014511 (owner: 10Majavah) [13:48:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1483:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1483 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:50:27] (03PS11) 10Klausman: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) [13:50:52] (03CR) 10Elukey: "I am not able to reproduce, if I build locally it works just fine:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [13:51:21] (03CR) 10Klausman: "I've removed the extra sections, so this now only has service entry and network policy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:53:41] (03CR) 10Elukey: [C:03+1] "Fine to proceed with testing, but we should spend some time in figuring out how to integrate our configs with the new services/calico-net-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:54:42] 06SRE, 10netops, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): 14Icinga BFD check failing - 14https://phabricator.wikimedia.org/T359198#9661583 (10fgiunchedi) 05Open→03Resolved 14This is fixed, I've undone my symlink bandaid. I've also reported the issue at https://bugs.debian.org/cgi-... [13:55:02] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1010881|knwikisource, knwiktionary: update logo, wordmark (T360022)]] (duration: 15m 53s) [13:55:06] T360022: Update logo for Kannada Wikisource and Wiktionary - https://phabricator.wikimedia.org/T360022 [13:55:21] anzx: ^ live — going to run 1014509 and 1011097 together, does that sound okay? [13:55:35] TheresNoTime: ok [13:55:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014509 (https://phabricator.wikimedia.org/T360246) (owner: 10Anzx) [13:55:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011097 (https://phabricator.wikimedia.org/T359741) (owner: 10Anzx) [13:56:14] (03PS1) 10Ssingh: depool esams for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1014514 (https://phabricator.wikimedia.org/T360430) [13:56:48] (03Merged) 10jenkins-bot: dewiki: Enable mobile page tabs for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014509 (https://phabricator.wikimedia.org/T360246) (owner: 10Anzx) [13:57:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:57:27] (03Merged) 10jenkins-bot: frwiki: update legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1011097 (https://phabricator.wikimedia.org/T359741) (owner: 10Anzx) [13:57:58] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1014509|dewiki: Enable mobile page tabs for everyone (T360246)]], [[gerrit:1011097|frwiki: update legacy vector logo (T359741)]] [13:58:03] T360246: Enable talk for mobile anon users on dewiki - https://phabricator.wikimedia.org/T360246 [13:58:04] T359741: Update frwiki PNG logo assets - https://phabricator.wikimedia.org/T359741 [14:00:39] !log samtar@deploy1002 anzx and samtar: Backport for [[gerrit:1014509|dewiki: Enable mobile page tabs for everyone (T360246)]], [[gerrit:1011097|frwiki: update legacy vector logo (T359741)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:00:42] TheresNoTime: checking [14:00:44] ty [14:00:52] (03Abandoned) 10Andrea Denisse: icinga: Set explicit BASEDIR for MIB database in snmp-mibs-downloader [puppet] - 10https://gerrit.wikimedia.org/r/1008941 (https://phabricator.wikimedia.org/T359198) (owner: 10Andrea Denisse) [14:01:01] (03CR) 10Volans: "The approach looks sane to me, I've left some minor comments inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [14:01:04] 10ops-codfw, 06SRE: 14Inbound interface errors - 14https://phabricator.wikimedia.org/T360972#9661644 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:01:24] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts parse1014.eqiad.wmnet [14:01:58] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:02:30] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:02:49] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:03:01] just fyi y'all, scap `Started docker pull on k8s nodes` had 1 error, `Error response from daemon: Get "https://docker-registry.discovery.wmnet/v2/": EOF`, is carrying on though — https://phabricator.wikimedia.org/P58923 [14:03:25] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:04:05] TheresNoTime: looks good [14:04:10] ack [14:04:14] !log samtar@deploy1002 anzx and samtar: Continuing with sync [14:10:03] (03PS1) 10JMeybohm: Add external-services namespace to all wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014518 (https://phabricator.wikimedia.org/T331894) [14:12:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:12:25] (SystemdUnitFailed) firing: (6) podman-auto-update.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:38] !log vriley@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts parse1014.eqiad.wmnet [14:15:21] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1014509|dewiki: Enable mobile page tabs for everyone (T360246)]], [[gerrit:1011097|frwiki: update legacy vector logo (T359741)]] (duration: 17m 23s) [14:15:27] T360246: Enable talk for mobile anon users on dewiki - https://phabricator.wikimedia.org/T360246 [14:15:27] T359741: Update frwiki PNG logo assets - https://phabricator.wikimedia.org/T359741 [14:15:34] anzx: live on prod, please check [14:15:50] TheresNoTime: checking [14:16:16] (03CR) 10JMeybohm: [C:03+2] Add external-services namespace to all wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014518 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [14:17:25] (SystemdUnitFailed) firing: (9) podman-auto-update.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:07] TheresNoTime: all new logo change appears for me [14:18:13] ack :) [14:18:19] TheresNoTime: thank you [14:18:36] np [14:18:50] !log UTC afternoon backport window done [14:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:19] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:19:34] (03Merged) 10jenkins-bot: Add external-services namespace to all wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014518 (https://phabricator.wikimedia.org/T331894) (owner: 10JMeybohm) [14:19:42] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:19:45] !log enabling and running puppet on P:restbase - T358213 [14:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:49] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [14:20:31] (03PS1) 10Majavah: cloudnfs: Add missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1014523 [14:20:51] !log Deploying split listener for 10% of backend restbase traffic to mw-api-int - T358213 [14:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:59] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:21:00] (03CR) 10CI reject: [V:04-1] cloudnfs: Add missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1014523 (owner: 10Majavah) [14:21:48] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:22:30] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:24:10] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:24:32] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:25:18] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:25:40] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:26:03] 10ops-eqiad, 06SRE: PowerSupplyFailure - https://phabricator.wikimedia.org/T360722#9661721 (10VRiley-WMF) After inspection, was unable to see which power supply was causing this issue. No indication while logged into the unit, and no LED's indicating such failure. Ran a firmware update for iDrac and this has r... [14:26:13] 10ops-eqiad, 06SRE: 14PowerSupplyFailure - 14https://phabricator.wikimedia.org/T360722#9661722 (10VRiley-WMF) 05Open→03Resolved [14:26:18] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:27:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:32:04] (03PS1) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [14:32:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.65% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:37:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.65% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:37:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:19] (03PS7) 10Herron: SLO queries for AQS 2.0 geo analytics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) (owner: 10Sg912) [14:39:45] (03PS2) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [14:40:55] (03PS3) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [14:41:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9661806 (10Ladsgroup) We are also considering implementing {T360589} to allow for improved storage and caching which would in turn enable SREs to change t... [14:43:15] (03PS2) 10Majavah: cloudnfs: Add missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1014523 [14:44:34] (03CR) 10Kosta Harlan: "IMO we should consider T359335 alongside this work, to minimize overrides in mediawiki-config where possible" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [14:49:31] (03CR) 10Herron: "Thanks for bumping this! Looking again I realize we needed to update the metric name in the availability queries to sort that error out, " [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) (owner: 10Sg912) [14:52:45] (ProbeDown) firing: (2) Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:53:07] (03CR) 10Elukey: "Please also keep in mind that we are experiencing this problem:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009472 (https://phabricator.wikimedia.org/T358751) (owner: 10Sg912) [14:54:22] jouncebot: nowandnext [14:54:23] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [14:54:23] In 0 hour(s) and 5 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1500) [14:55:17] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9661855 (10Clement_Goubert) [14:55:40] (KubernetesRsyslogDown) firing: rsyslog on mw1483:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1483 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:55:49] !log jnuche@deploy1002 Installing scap version "4.73.2" for 364 hosts [14:56:43] !log jnuche@deploy1002 Installation of scap version "4.73.2" completed for 364 hosts [14:57:02] jayme: if you want to check out a node with its syslog failing ^ [14:57:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:28] claime: ack, going to take a look [14:57:48] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q4): Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#9661867 (10fgiunchedi) [14:57:49] 07sre-alert-triage, 10SRE Observability (FY2023/2024-Q4): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9661869 (10fgiunchedi) [14:57:50] 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4), 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788#9661868 (10fgiunchedi) [14:58:17] 06SRE, 10Cloud-VPS, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#9661872 (10fgiunchedi) [14:59:25] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9661885 (10Clement_Goubert) 10% of `RESTbase`'s backend `mwapi` requests are now made to `mw-api-int` {F43443512} [14:59:40] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:05] eoghan, jelto, and arnoldokoth: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1500) [15:02:47] claime: that's a new one ...impstats: error reading /proc/2876433/fd : Too many open files [15:03:11] jayme: definitely a new one [15:08:46] claime: it seems to keep fd's for all the deleted container logs... [15:09:11] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-coord1001.eqiad.wmnet [15:09:15] # ls /proc/2876433/fd/ -la | grep '(deleted)' -c [15:09:16] 16243 [15:13:05] (03PS4) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [15:15:33] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [15:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:07] (03PS1) 10MVernon: ceph: add udev to container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1014532 [15:18:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [15:19:14] (03PS5) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [15:19:37] (03CR) 10Dreamy Jazz: [C:04-2] "Must wait until Ifa5a0123cd915bdb7c87e473c51fb93321622f12 is deployed on all wikis before deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [15:19:45] (03CR) 10Elukey: "To keep archives happy, Python in Blubber images looks for modules in:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:19:58] (03CR) 10Dreamy Jazz: [C:04-2] "Created Ifa5a0123cd915bdb7c87e473c51fb93321622f12 for that task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [15:20:13] hashar: I agree that the flakiness of wikibugs in the last month or so has been frustrating. My most recent attempt at making it more robust was switching all of the Redis connections to use the Python redis.asyncio library code with socket keepalive and connection failure retry turned on. [15:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:21:03] bd808: yeah I have been wondering whether it was just me having latency and whether anyone else noticed :) [15:21:13] The even more frustrating result of this has been that the testing deployment has been running for more than 4.5 days with no issues, but the main deployment continues to get confused about what it is doing once a day. [15:21:16] my rant is why nobody else filed a task about it, but I did not either hehe [15:21:29] oh [15:22:11] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [15:22:19] hashar: t.aavi has been doing the restarts around EU midday I think, so you probably see the queue draining more than lots of other folks. [15:22:34] yeah that is what he told me [15:23:16] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [15:23:16] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:16] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-coord1001.eqiad.wmnet [15:23:33] and there is of course a fairly large set of reasons for the connection to end up stall :/ [15:25:02] yeah, there are a couple of different layers of software defined networking, different exec nodes in Toolforge, and potential resource contention and noisy neighbors in shared spaces that all add variablity. [15:25:17] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-coord1002.eqiad.wmnet [15:26:06] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:26:14] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9661985 (10WDoranWMF) Approved from me! [15:26:18] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:26:28] I'm heading quickly towards pulling Redis out of the project entirely to be replaced by what feels like a simpler webservice based work queue for the producers and consumer to talk to. [15:26:43] KAAAAFFFFKKKAAAA [15:27:07] or there is gearman which can have the queue backed up in sqlite or something like that [15:27:10] (KubernetesRsyslogDown) resolved: rsyslog on mw1483:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1483 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:27:17] but that is an obsolete proto nowadays [15:27:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [15:27:37] (03CR) 10Ladsgroup: [C:03+1] ceph: add udev to container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1014532 (owner: 10MVernon) [15:27:40] (03CR) 10David Caro: [C:03+1] cloudceph::osd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013521 (owner: 10Muehlenhoff) [15:27:47] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:27:49] (03CR) 10David Caro: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1013521 (owner: 10Muehlenhoff) [15:27:56] (03CR) 10Klausman: [C:03+1] "SGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:28:03] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:28:05] (03PS1) 10Brouberol: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) [15:28:45] (03CR) 10CI reject: [V:04-1] Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:29:15] I actually thought about kafka, but from what I can tell that is more suited for a persistent event stream than for simple queued event passing between processes. [15:29:40] (03PS1) 10Elukey: role::docker_registry_ha::registry: set nginx's tmpfs size in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1014534 (https://phabricator.wikimedia.org/T360637) [15:30:20] !log jebe@deploy1002 Started deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] [15:30:24] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [15:30:35] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [15:31:05] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:31:06] (03CR) 10MVernon: [V:03+2 C:03+2] ceph: add udev to container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1014532 (owner: 10MVernon) [15:31:38] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1014534 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [15:32:17] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:32:19] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:32:22] bd808: o/ in theory kafka is suited to do queue events between distributed processes, even if they are not in a continous stream [15:32:47] I mean it may be an option (didn't follow the whole conversation, just the last bits) [15:32:57] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [15:33:33] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:33:36] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:34:37] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [15:35:08] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [15:37:25] (SystemdUnitFailed) firing: (9) podman-auto-update.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:33] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [15:39:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [15:39:25] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [15:39:25] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:26] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-coord1002.eqiad.wmnet [15:39:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9662008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [15:39:58] elukey: :nod: If you build me a robust multi-tenant Kafka cluster in Toolforge I'll try to learn to use it. ;) Probably a bit heavy tech to hand off json blobs between 3-4 python processes for an IRC notification bot though. [15:40:12] (03CR) 10Clément Goubert: [C:03+1] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1014534 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [15:41:35] (03PS1) 10Cwhite: beta-logs: disable loki output [puppet] - 10https://gerrit.wikimedia.org/r/1014056 [15:42:25] (SystemdUnitFailed) firing: (9) podman-auto-update.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:26] !log jebe@deploy1002 Finished deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] (duration: 13m 05s) [15:43:30] (03CR) 10Btullis: "Nice! I spotted a few outdated comments mentioning an-coord100[1-2] but that's nitpicking." [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:43:35] !log jebe@deploy1002 Started deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] [15:44:27] bd808: ahahah okok I was in for a chat not a big project, nevermind :D [15:45:00] (03PS2) 10Brouberol: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) [15:45:13] (03CR) 10Btullis: Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:45:31] Emperor: moss-be hosts have a bunch of podman failed units, PTAL [15:46:01] !log jebe@deploy1002 Finished deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] (duration: 02m 26s) [15:46:35] (03CR) 10Brouberol: Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:47:21] (03PS3) 10Brouberol: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) [15:51:36] (03CR) 10CI reject: [V:04-1] Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:51:48] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:51:56] (03CR) 10Marco Fossati: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [15:52:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58926 and previous config saved to /var/cache/conftool/dbconfig/20240326-155227-arnaudb.json [15:52:50] (03PS1) 10JMeybohm: changeprop-jobqueue: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) [15:53:00] (03CR) 10Cwhite: [C:03+2] beta-logs: disable loki output [puppet] - 10https://gerrit.wikimedia.org/r/1014056 (owner: 10Cwhite) [15:54:01] (03CR) 10CI reject: [V:04-1] changeprop-jobqueue: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:54:31] elukey: you are too smart to fall for my clumsy nerd snipe attempt! :) [15:55:22] (03PS2) 10JMeybohm: changeprop: Add base.external-services-networkpolicy:1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014539 (https://phabricator.wikimedia.org/T359423) [15:55:22] (03PS2) 10JMeybohm: changeprop: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014540 (https://phabricator.wikimedia.org/T359423) [15:55:23] (03PS2) 10JMeybohm: changeprop-jobqueue: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) [15:55:50] bd808: I learned the hard way after too many nerd snipes :D [15:57:06] (03PS1) 10Majavah: cinderutils: Don't format volumes without explicit permission [puppet] - 10https://gerrit.wikimedia.org/r/1014543 [15:57:42] (03CR) 10Brouberol: [C:03+1] "I'm going to trust you on this one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014538 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:57:54] (03CR) 10CI reject: [V:04-1] cinderutils: Don't format volumes without explicit permission [puppet] - 10https://gerrit.wikimedia.org/r/1014543 (owner: 10Majavah) [15:57:57] (03CR) 10Brouberol: [C:03+1] changeprop: Add base.external-services-networkpolicy:1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014539 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:58:00] (03CR) 10Andrew Bogott: [C:03+1] cinderutils: Don't format volumes without explicit permission [puppet] - 10https://gerrit.wikimedia.org/r/1014543 (owner: 10Majavah) [15:58:10] (03CR) 10JMeybohm: "lol" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014538 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:58:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9662102 (10jcrespo) [15:58:42] (03CR) 10Brouberol: [C:03+1] "That must have felt nice to delete these copy-pasted CIDRs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014540 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:58:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9662107 (10jcrespo) a:03jcrespo [15:59:02] (03PS2) 10Majavah: cinderutils: Don't format volumes without explicit permission [puppet] - 10https://gerrit.wikimedia.org/r/1014543 [15:59:05] (03CR) 10Brouberol: [C:03+1] changeprop-jobqueue: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [16:00:05] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1600) [16:00:05] cormacparle: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:16] * cormacparle waves [16:00:19] !log jebe@deploy1002 Started deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] [16:00:36] (03CR) 10Majavah: [C:03+2] cinderutils: Don't format volumes without explicit permission [puppet] - 10https://gerrit.wikimedia.org/r/1014543 (owner: 10Majavah) [16:01:38] !log jebe@deploy1002 Finished deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] (duration: 01m 18s) [16:02:57] rzl: ... don't really know what to expect here, tell me if I need to do something [16:03:25] 06SRE, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9662130 (10Dzahn) @DBu-WMF Sorry, I tried. Then it's further restricted than just NDA-level due to Security. Please contact @Jgreen, @jhathaway or the [[ https://security.wikimedia.org/ | security team... [16:04:40] (03PS1) 10AikoChou: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014545 (https://phabricator.wikimedia.org/T360423) [16:05:40] cormacparle: hi, sorry I'm just late :) [16:05:57] cormacparle: I can go ahead and merge the maintenance patch, but ideally I'd like to have a review from a dumps expert and a DE expert for the other two [16:06:37] aha! ok cool ... any suggestions who? [16:07:05] (DE is data engineering, right?) [16:07:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58927 and previous config saved to /var/cache/conftool/dbconfig/20240326-160733-arnaudb.json [16:08:24] I'll try and find someone ... [16:09:10] !log restart pybal on lvs2013 [16:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:04] (03PS1) 10Catrope: CodexHTMLForm: Fix margins around links in login form [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014456 (https://phabricator.wikimedia.org/T360945) [16:12:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [16:12:37] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) (owner: 10Jcrespo) [16:12:50] cormacparle: data engineering yeah, for the eventlogging change -- I can merge it, I just don't know enough to be confident that it's correct and I don't know if anything else needs to get done when I do :) [16:13:01] same goes for the dumps change [16:13:18] kk cool ... could you give me a little while to try and find someone? [16:13:56] yeah of course, no hurry on my end [16:14:08] drop me a PM if you like, even outside the window [16:14:46] meantime I'll go ahead and merge the maintenance patch at least -- assuming they don't all need to happen at the same time? [16:16:30] it's fine if they don't happen at the same time [16:16:40] cool, thanks Reuven [16:17:01] thank you! sorry for the only partial success :) [16:17:12] (03CR) 10RLazarus: [C:03+2] MachineVision is being sunsetted, so remove job [puppet] - 10https://gerrit.wikimedia.org/r/1013329 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [16:18:00] !log Importing karma 0.119 to reprepro - T333615 [16:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:10] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [16:18:46] godog: sorry, yes, they're in development/hacking not production. [16:19:34] ack [16:22:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58928 and previous config saved to /var/cache/conftool/dbconfig/20240326-162238-arnaudb.json [16:24:01] (03PS1) 10Brouberol: hue: drop CNAME DNS record [dns] - 10https://gerrit.wikimedia.org/r/1014549 [16:25:28] (03PS2) 10Brouberol: hue: drop CNAME DNS record [dns] - 10https://gerrit.wikimedia.org/r/1014549 (https://phabricator.wikimedia.org/T341895) [16:25:51] (03PS1) 10Brouberol: ats: drop mapping rule redirecting to hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014550 (https://phabricator.wikimedia.org/T341895) [16:26:02] (03CR) 10Clément Goubert: [C:03+1] service: remove similar-users from realserver, set service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014500 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [16:27:25] (SystemdUnitFailed) resolved: (6) podman-auto-update.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:06] (03CR) 10Clément Goubert: [C:03+1] wmnet: remove similar-users [dns] - 10https://gerrit.wikimedia.org/r/1014495 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [16:30:21] (03PS1) 10DCausse: wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) [16:31:43] (03PS1) 10Cwhite: beta-logs: add ssd-0[123] host configs [puppet] - 10https://gerrit.wikimedia.org/r/1014057 (https://phabricator.wikimedia.org/T353912) [16:32:27] (03PS1) 10Ssingh: P:pybal: fix regex for peer addresses for lvs2013 and 14 [puppet] - 10https://gerrit.wikimedia.org/r/1014552 [16:33:16] (03CR) 10Jgiannelos: [C:03+1] mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [16:33:39] (03CR) 10Cwhite: [C:03+2] beta-logs: add ssd-0[123] host configs [puppet] - 10https://gerrit.wikimedia.org/r/1014057 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [16:33:39] (03CR) 10CI reject: [V:04-1] wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [16:34:08] (03CR) 10Jgiannelos: [C:03+1] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [16:34:42] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1728/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014552 (owner: 10Ssingh) [16:35:01] (03PS2) 10DCausse: wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) [16:36:04] (03CR) 10Ssingh: [V:03+1 C:03+2] P:pybal: fix regex for peer addresses for lvs2013 and 14 [puppet] - 10https://gerrit.wikimedia.org/r/1014552 (owner: 10Ssingh) [16:37:17] (03PS1) 10Brouberol: cache: remove caching config for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014553 (https://phabricator.wikimedia.org/T341895) [16:37:19] (03PS1) 10Brouberol: cumin: remove hue alias [puppet] - 10https://gerrit.wikimedia.org/r/1014554 (https://phabricator.wikimedia.org/T341895) [16:37:20] (03PS1) 10Brouberol: site: change an-tool1009 role back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1014555 (https://phabricator.wikimedia.org/T341895) [16:37:22] (03PS1) 10Brouberol: idp: drop hue client configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014556 (https://phabricator.wikimedia.org/T341895) [16:37:23] (03PS1) 10Brouberol: aqs: remove manifests and configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) [16:37:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58929 and previous config saved to /var/cache/conftool/dbconfig/20240326-163744-arnaudb.json [16:37:49] 06SRE, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9662359 (10andrea.denisse) 05Open→03In progress [16:38:27] (03PS3) 10Jcrespo: admin: Add GeorgeMikesell's production access [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) [16:38:27] (03PS1) 10Jcrespo: admin: Add Mike Pham (mttp) to analytics-privatedata-users (keyless) [puppet] - 10https://gerrit.wikimedia.org/r/1014558 (https://phabricator.wikimedia.org/T360641) [16:39:06] !log restart pybal on lvs2013 and lvs2014 [16:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:55] (03PS1) 10Tchanders: Prevent new user names matching the temporary account pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) [16:40:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [16:40:15] (03CR) 10CI reject: [V:04-1] admin: Add Mike Pham (mttp) to analytics-privatedata-users (keyless) [puppet] - 10https://gerrit.wikimedia.org/r/1014558 (https://phabricator.wikimedia.org/T360641) (owner: 10Jcrespo) [16:40:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9662383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [16:42:58] (03PS1) 10Majavah: O:wmcs::toolforge::docker::registry: remove apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/1014560 [16:43:34] (03CR) 10Jcrespo: "This already exists?" [puppet] - 10https://gerrit.wikimedia.org/r/1014558 (https://phabricator.wikimedia.org/T360641) (owner: 10Jcrespo) [16:43:34] (03CR) 10Kosta Harlan: [C:03+1] Prevent new user names matching the temporary account pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [16:45:56] (03CR) 10Majavah: [C:03+2] O:wmcs::toolforge::docker::registry: remove apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/1014560 (owner: 10Majavah) [16:47:33] (03CR) 10Dreamy Jazz: "Somewhat duplicates If974ba8d09c235faeb033ab107fdb246d5877644, but this is just a subset of that patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [16:48:00] (03PS2) 10Dreamy Jazz: Prevent new user names matching the temporary account pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [16:48:12] (03CR) 10SBassett: [C:03+1] "LGTM and I think we're ready to test an enforcing CSP on a project (testwiki). We'll need an SRE to get this deployed though." [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [16:48:19] (03CR) 10Dreamy Jazz: [C:03+1] Prevent new user names matching the temporary account pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [16:49:54] !log jebe@deploy1002 Started deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] [16:50:08] !log jebe@deploy1002 Finished deploy [analytics/refinery@07a0290]: Regular analytics weekly train [analytics/refinery@07a0290a] (duration: 00m 14s) [16:50:54] !log jebe@deploy1002 Started deploy [analytics/refinery@07a0290] (thin): Regular analytics weekly train THIN [analytics/refinery@07a0290a] [16:50:58] !log jebe@deploy1002 Finished deploy [analytics/refinery@07a0290] (thin): Regular analytics weekly train THIN [analytics/refinery@07a0290a] (duration: 00m 04s) [16:51:19] !log jebe@deploy1002 Started deploy [analytics/refinery@07a0290] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@07a0290a] [16:54:20] (03CR) 10Jcrespo: [C:04-2] "Already deployed, 4 years ago." [puppet] - 10https://gerrit.wikimedia.org/r/1014558 (https://phabricator.wikimedia.org/T360641) (owner: 10Jcrespo) [16:54:57] !log jebe@deploy1002 Finished deploy [analytics/refinery@07a0290] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@07a0290a] (duration: 03m 38s) [16:55:04] (03Abandoned) 10Jcrespo: admin: Add Mike Pham (mttp) to analytics-privatedata-users (keyless) [puppet] - 10https://gerrit.wikimedia.org/r/1014558 (https://phabricator.wikimedia.org/T360641) (owner: 10Jcrespo) [16:56:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9662474 (10jcrespo) Apologies, but this access was already provided back in 2020 at T270438 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/650298), a... [16:56:31] (03PS1) 10Ssingh: P:pybal: add a check to ensure Pybal service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/1014563 [16:56:59] (03CR) 10CI reject: [V:04-1] P:pybal: add a check to ensure Pybal service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/1014563 (owner: 10Ssingh) [16:57:35] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1729/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014563 (owner: 10Ssingh) [16:57:56] (03PS2) 10Ssingh: P:pybal: add a check to ensure Pybal service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/1014563 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1700) [17:00:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 29.29% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:00:30] (03CR) 10BBlack: [C:03+1] P:pybal: add a check to ensure Pybal service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/1014563 (owner: 10Ssingh) [17:05:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 29.29% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:06:50] !log add georgemikesell to wmf ldap group T358922 [17:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:05] T358922: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922 [17:07:33] (03CR) 10Jcrespo: [C:03+2] admin: Add GeorgeMikesell's production access [puppet] - 10https://gerrit.wikimedia.org/r/1014426 (https://phabricator.wikimedia.org/T358922) (owner: 10Jcrespo) [17:09:49] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [17:10:02] (03PS11) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [17:10:13] (03PS3) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [17:11:14] (03PS1) 10RLazarus: maintenance: Absent the MachineVision_prioritize_uncategorized job [puppet] - 10https://gerrit.wikimedia.org/r/1014565 (https://phabricator.wikimedia.org/T352884) [17:11:35] (03CR) 10Hnowlan: [C:03+2] mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:12:45] (03Merged) 10jenkins-bot: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [17:13:22] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1730/co" [puppet] - 10https://gerrit.wikimedia.org/r/1014565 (https://phabricator.wikimedia.org/T352884) (owner: 10RLazarus) [17:13:37] (03PS2) 10RLazarus: maintenance: Absent the MachineVision_prioritize_uncategorized job [puppet] - 10https://gerrit.wikimedia.org/r/1014565 (https://phabricator.wikimedia.org/T352884) [17:13:51] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [17:14:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9662550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [17:17:03] (03CR) 10RLazarus: [C:03+2] maintenance: Absent the MachineVision_prioritize_uncategorized job [puppet] - 10https://gerrit.wikimedia.org/r/1014565 (https://phabricator.wikimedia.org/T352884) (owner: 10RLazarus) [17:17:21] (03CR) 10CI reject: [V:04-1] alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [17:17:38] (03CR) 10Mforns: MachineVision extension is sunsetted, stop refining events [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [17:21:23] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1006'] [17:21:48] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbprov1006'] [17:22:08] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1006'] [17:22:22] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbprov1006'] [17:22:25] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1006'] [17:22:47] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbprov1006'] [17:23:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for GeorgeMikesell - 14https://phabricator.wikimedia.org/T358922#9662581 (10jcrespo) 05Open→03Resolved 14@GMikesell-WMF (or @cchen on his behalf)- access has been merged, it may take ~30 minutes to b... [17:24:01] (03PS3) 10Hnowlan: mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) [17:24:26] (03CR) 10Ssingh: [C:03+2] P:pybal: add a check to ensure Pybal service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/1014563 (owner: 10Ssingh) [17:28:46] (03PS1) 10Ssingh: P:pybal: install pystemd package for file monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014570 [17:30:13] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014058 [17:31:08] (03PS4) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [17:32:17] (03PS2) 10Ssingh: P:pybal: install pystemd package for file monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014570 [17:32:36] (03PS5) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [17:33:03] (03CR) 10Majavah: "Thanks for the review!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [17:34:05] (03PS1) 10Bartosz Dziewoński: topicsubscriptions.js: No longer assume both buttons and links exist in DOM [extensions/DiscussionTools] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014457 (https://phabricator.wikimedia.org/T360942) [17:34:56] (03PS1) 10Fabfur: hiera: second nvme disk to text_esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) [17:35:03] (03PS3) 10Cparle: MachineVision extension is sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) [17:35:39] (03CR) 10Ssingh: [C:03+2] P:pybal: install pystemd package for file monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1014570 (owner: 10Ssingh) [17:37:20] (03PS4) 10Cparle: MachineVision extension is sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) [17:38:16] (03PS6) 10Santiago Faci: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) [17:38:36] (03PS2) 10Fabfur: hiera: second nvme disk to text_esams [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) [17:40:03] (03CR) 10Ssingh: [C:03+1] "[as you mention but so that we remember: merge AFTER per-host overrides]" [puppet] - 10https://gerrit.wikimedia.org/r/1014571 (https://phabricator.wikimedia.org/T360430) (owner: 10Fabfur) [17:40:34] (03CR) 10CI reject: [V:04-1] alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [17:40:56] (03PS6) 10Jcrespo: mediabackups: Add newly setup storage host backup1011 [puppet] - 10https://gerrit.wikimedia.org/r/995188 (https://phabricator.wikimedia.org/T334069) [17:40:56] (03PS6) 10Jcrespo: mediabackups: Add newly setup storage host backup2011 [puppet] - 10https://gerrit.wikimedia.org/r/995189 (https://phabricator.wikimedia.org/T334069) [17:40:56] (03PS1) 10Jcrespo: mariadb: Disable notifications for db2100 (long hw issues) [puppet] - 10https://gerrit.wikimedia.org/r/1014573 (https://phabricator.wikimedia.org/T361037) [17:41:08] (03PS2) 10Jcrespo: mariadb: Disable notifications for db2100 (long hw issues) [puppet] - 10https://gerrit.wikimedia.org/r/1014573 (https://phabricator.wikimedia.org/T361037) [17:41:59] (03PS6) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [17:42:18] (03CR) 10Jcrespo: [C:03+2] mariadb: Disable notifications for db2100 (long hw issues) [puppet] - 10https://gerrit.wikimedia.org/r/1014573 (https://phabricator.wikimedia.org/T361037) (owner: 10Jcrespo) [17:42:46] (03PS3) 10DCausse: wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) [17:43:12] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [17:44:56] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:45:16] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:46:16] (03CR) 10Joal: "One nit please" [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [17:49:18] (03CR) 10CI reject: [V:04-1] alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [17:49:19] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DBA, and 2 others: db2100 crashed - https://phabricator.wikimedia.org/T361037#9662658 (10jcrespo) DC Ops, the host crashed and 3 memory banks are mapped out. Can you evaluate the host and either ask for in warranty replacements or any other alte... [17:49:48] (03PS5) 10Cparle: MachineVision extension is sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) [17:50:00] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DBA, and 2 others: db2100 crashed (memory error) - https://phabricator.wikimedia.org/T361037#9662663 (10jcrespo) [17:50:12] (03CR) 10Cparle: MachineVision extension is sunsetted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [17:52:13] (03CR) 10Joal: [C:03+1] "LGTM! Thanks for pinging us" [puppet] - 10https://gerrit.wikimedia.org/r/1013519 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [17:52:31] (03PS1) 10RLazarus: maintenance: Re-remove absented MachineVision_prioritize_uncategorized job [puppet] - 10https://gerrit.wikimedia.org/r/1014576 (https://phabricator.wikimedia.org/T352884) [17:53:23] (03CR) 10Santiago Faci: "Changing the stream name according to what we have decided" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [17:55:39] (03CR) 10RLazarus: [C:03+2] maintenance: Re-remove absented MachineVision_prioritize_uncategorized job [puppet] - 10https://gerrit.wikimedia.org/r/1014576 (https://phabricator.wikimedia.org/T352884) (owner: 10RLazarus) [17:56:13] (03PS7) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [17:57:58] (03CR) 10DCausse: "pcc output: https://puppet-compiler.wmflabs.org/output/1014551/1734/" [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [18:00:05] jeena and dancy: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T1800). [18:00:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2005.codfw.wmnet with OS bullseye [18:00:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9662695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye executed w... [18:08:54] Hi, can anyone merge this? [18:08:54] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1010938 [18:13:59] Aram46: please schedule for a backport & config window like https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T2000 [18:15:22] Aram46: ^ +1 the process for these types of requests is outlined on https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Lifecycle_of_a_request and adding to a backport window a step in that process [18:16:30] (03CR) 10Hnowlan: [C:03+2] mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [18:16:54] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014579 (https://phabricator.wikimedia.org/T360156) [18:16:55] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014579 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [18:17:28] (03Merged) 10jenkins-bot: mobileapps: add cassandra config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [18:19:07] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014579 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [18:20:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:20:35] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:21:26] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:21:56] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:30:42] (03PS1) 10DCausse: updateQueryServiceLag: tune the min query rate on a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) [18:34:24] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.24 refs T360156 [18:34:29] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [18:36:43] !log denisse@cumin2002 START - Cookbook sre.puppet.migrate-host for host alert2001.wikimedia.org [18:36:45] !log denisse@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host alert2001.wikimedia.org [18:47:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [18:47:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9662838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [18:47:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [18:48:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9662839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [18:48:48] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: 14Connect two hosts in codfw row A/B for switch migration testing - 14https://phabricator.wikimedia.org/T345803#9662842 (10Papaul) 05Open→03Resolved [18:51:34] (03CR) 10RLazarus: [C:04-1] "(Puppet window reviewer here! Generic review on Puppet semantics only -- Ariel's team can comment on dumps.)" [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [18:59:34] (03Restored) 10Andrea Denisse: alert: Ensure the alert1001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [18:59:40] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:43] (03Restored) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [19:02:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage [19:05:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage [19:07:13] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9662920 (10Papaul) dbprov2005 was failing after installing the OS may times. after troubleshooting, when the server reboots into the OS after the OS instal... [19:08:03] (03PS4) 10Andrea Denisse: alert: Update hiera entries for alert2001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) [19:08:52] (03PS1) 10CDobbins: icinga: add cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1014589 [19:10:10] (03PS2) 10Andrea Denisse: alert: Update hiera entries for alert1001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615) [19:10:10] (03PS1) 10BBlack: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1014590 [19:11:46] (03CR) 10Andrea Denisse: "Hello everyone, following the guidelines of the sre.puppet.migrate-host cookbook we need to merge this change before proceeding with the P" [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [19:11:59] (03CR) 10Andrea Denisse: "Hello everyone, following the guidelines of the sre.puppet.migrate-host cookbook we need to merge this change before proceeding with the P" [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [19:12:16] (03CR) 10Andrea Denisse: "Hello everyone, following the guidelines of the sre.puppet.migrate-host cookbook we need to merge this change before proceeding with the P" [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [19:13:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046 (10BBlack) 03NEW [19:14:09] (03PS2) 10BBlack: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) [19:14:51] (03CR) 10Ssingh: Add myself to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) (owner: 10BBlack) [19:15:09] (03PS2) 10Ebernhardson: cirrus: Check backfill status prior to reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008895 [19:15:09] (03PS1) 10Ebernhardson: cirrus: More reliable reporting of reindexing status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 [19:17:15] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046#9663008 (10BBlack) [19:18:51] (03PS3) 10BBlack: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) [19:19:00] (03CR) 10BBlack: Add myself to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) (owner: 10BBlack) [19:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:20:45] (03CR) 10Ssingh: [C:03+1] Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) (owner: 10BBlack) [19:20:48] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:20:58] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:21:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:21:48] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:21:55] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:25:20] (03PS5) 10Andrea Denisse: alert: Update hiera entries for alert2001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T358506) [19:26:16] (03PS3) 10Andrea Denisse: alert: Update hiera entries for alert1001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T358506) [19:33:52] (03PS2) 10Ebernhardson: cirrus: More reliable reporting of reindexing status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014593 [19:34:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:34:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:37:47] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046#9663166 (10KOfori) Approved. [19:41:16] (03PS1) 10Dzahn: create sysop-pl.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) [19:41:28] (03PS2) 10Dzahn: create sysop-pl.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) [19:43:03] (03CR) 10Ladsgroup: [C:04-1] "we don't create misc wikis under wikipedia.org TLD. For many reasons." [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [19:45:14] (03CR) 10Dzahn: "I also wanted to suggest wikimedia.org first but see the line right above it, there already is one just like this." [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [19:46:03] (03CR) 10Dzahn: "https://sysop-it.wikipedia.org/wiki/Pagina_principale" [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [19:46:15] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [19:46:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9663217 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [19:46:56] (03PS1) 10Ryan Kemper: elastic: replace some masters [puppet] - 10https://gerrit.wikimedia.org/r/1014600 (https://phabricator.wikimedia.org/T358882) [19:48:44] !log phabricator - added GMikesell-WMF to WMF-NDA because that goes together with the wmf LDAP group (https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group) - T358922 [19:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:48] T358922: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922 [19:49:24] (03PS2) 10Ryan Kemper: elastic: replace some masters [puppet] - 10https://gerrit.wikimedia.org/r/1014600 (https://phabricator.wikimedia.org/T358882) [19:49:31] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1014600 (https://phabricator.wikimedia.org/T358882) (owner: 10Ryan Kemper) [19:49:43] (03CR) 10Urbanecm: "Many such wikis exist (arbcom-*.wikipedia.org is a common pattern). It also feels like a natural way to indicate which project family the " [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [19:53:42] (03CR) 10Bking: [C:03+1] elastic: replace some masters [puppet] - 10https://gerrit.wikimedia.org/r/1014600 (https://phabricator.wikimedia.org/T358882) (owner: 10Ryan Kemper) [19:53:54] (03CR) 10Ryan Kemper: [C:03+2] elastic: replace some masters [puppet] - 10https://gerrit.wikimedia.org/r/1014600 (https://phabricator.wikimedia.org/T358882) (owner: 10Ryan Kemper) [19:55:41] (03CR) 10Ladsgroup: [C:04-1] "It was done as a mistake, many wikis are made with historical notes and ideas that change by now." [dns] - 10https://gerrit.wikimedia.org/r/1014598 (https://phabricator.wikimedia.org/T361041) (owner: 10Dzahn) [19:58:09] (03CR) 10Ahmon Dancy: [C:03+1] httpbb: raise timeout for Barack Obama [puppet] - 10https://gerrit.wikimedia.org/r/1014425 (https://phabricator.wikimedia.org/T360867) (owner: 10Hashar) [19:58:21] (03CR) 10Thcipriani: [C:03+1] httpbb: raise timeout for Barack Obama [puppet] - 10https://gerrit.wikimedia.org/r/1014425 (https://phabricator.wikimedia.org/T360867) (owner: 10Hashar) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240326T2000). [20:00:05] RoanKattouw and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] I can self-service [20:00:35] hi [20:00:49] And I can do yours too [20:01:32] ty [20:01:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014456 (https://phabricator.wikimedia.org/T360945) (owner: 10Catrope) [20:01:48] (03CR) 10Catrope: [C:03+2] topicsubscriptions.js: No longer assume both buttons and links exist in DOM [extensions/DiscussionTools] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014457 (https://phabricator.wikimedia.org/T360942) (owner: 10Bartosz Dziewoński) [20:02:04] Also +2ing yours manually so that the CI run can happen in parallel [20:02:12] (03CR) 10David Martin: [C:03+1] Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [20:02:20] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9663304 (10VRiley-WMF) a:03VRiley-WMF [20:05:23] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9663329 (10VRiley-WMF) @herron As it turns out, we currently don't have spare memory at 32Gig DDR4 3200. However, we have plenty of 32Gig DDR4 2666. Would this be an acceptabl... [20:09:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:20] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: cycle some masters - ryankemper@cumin2002 - T358882 [20:09:28] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [20:13:34] (03PS1) 10Dzahn: miscweb: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014605 (https://phabricator.wikimedia.org/T360413) [20:14:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:35] (03PS1) 10Dzahn: vrts: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014606 (https://phabricator.wikimedia.org/T360413) [20:20:31] (03PS2) 10Dzahn: vrts: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014606 (https://phabricator.wikimedia.org/T360413) [20:20:38] (03Merged) 10jenkins-bot: CodexHTMLForm: Fix margins around links in login form [core] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014456 (https://phabricator.wikimedia.org/T360945) (owner: 10Catrope) [20:20:42] (03Merged) 10jenkins-bot: topicsubscriptions.js: No longer assume both buttons and links exist in DOM [extensions/DiscussionTools] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014457 (https://phabricator.wikimedia.org/T360942) (owner: 10Bartosz Dziewoński) [20:21:11] !log catrope@deploy1002 Started scap: Backport for [[gerrit:1014456|CodexHTMLForm: Fix margins around links in login form (T360945)]] [20:21:15] T360945: styles: remove spacing on cdx-field links - https://phabricator.wikimedia.org/T360945 [20:22:12] (03PS3) 10Dzahn: vrts: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014606 (https://phabricator.wikimedia.org/T360413) [20:23:37] (03PS1) 10Dzahn: ssl: delete ticket.discovery.wmnet cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014607 (https://phabricator.wikimedia.org/T360413) [20:27:01] !log catrope@deploy1002 catrope: Backport for [[gerrit:1014456|CodexHTMLForm: Fix margins around links in login form (T360945)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:06] T360945: styles: remove spacing on cdx-field links - https://phabricator.wikimedia.org/T360945 [20:27:27] MatmaRex: Yours accidentally hitched a ride to the test servers as well, so please test your patch [20:27:43] looking [20:29:25] (SystemdUnitFailed) firing: (3) elasticsearch-disable-readahead.service on elastic2109:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:26] RoanKattouw: looks good [20:30:36] Great! Mine looks good too, proceeding [20:30:41] !log catrope@deploy1002 catrope: Continuing with sync [20:30:51] (tested at https://test.wikipedia.org/wiki/Talk:2024-03-26_test) [20:33:57] (03CR) 10Dzahn: "take a look at what it actually does: https://puppet-compiler.wmflabs.org/output/1013649/1736/stewards1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [20:33:57] What about mine? This is my very first one and don't know what to do next. [20:34:45] The first ones aren't finished yet [20:35:24] Okay [20:36:03] Aram46: Apologies, I didn't see yours because it was added so recently. I'll do yours after mine and MatmaRex's patches finish syncing, that'll probably take another 10 minutes or so [20:37:30] RoanKattouw: It is okay, take your time. [20:38:12] (03CR) 10Dzahn: "I don't think we have a ticket for this, but also see https://phabricator.wikimedia.org/rOPUPf7418873e143a05e9518649c4d2b6afc34cababb" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [20:40:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maint T352010 [20:40:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:41:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maint T352010 [20:41:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [20:41:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [20:43:21] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1014456|CodexHTMLForm: Fix margins around links in login form (T360945)]] (duration: 22m 09s) [20:43:26] T360945: styles: remove spacing on cdx-field links - https://phabricator.wikimedia.org/T360945 [20:43:46] (03PS4) 10Catrope: Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010938 (https://phabricator.wikimedia.org/T360228) (owner: 10Aram) [20:43:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010938 (https://phabricator.wikimedia.org/T360228) (owner: 10Aram) [20:44:50] (03PS1) 10Dzahn: doc: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014610 [20:45:08] (03Merged) 10jenkins-bot: Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010938 (https://phabricator.wikimedia.org/T360228) (owner: 10Aram) [20:45:40] !log catrope@deploy1002 Started scap: Backport for [[gerrit:1010938|Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary (T360228)]] [20:45:44] T360228: Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary - https://phabricator.wikimedia.org/T360228 [20:46:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: Maint T343718 [20:46:13] (03PS1) 10Dzahn: releases: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014611 [20:46:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:46:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: Maint T343718 [20:47:39] (03PS1) 10Dzahn: peopleweb: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014612 [20:49:02] !log catrope@deploy1002 aram and catrope: Backport for [[gerrit:1010938|Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary (T360228)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:50:13] Aram46 I'm checking your change now, just going to check that Special:Listgrouprights displays the right things. If you have the WikimediaDebug browser extension installed you can check this yourself too. If not, no worries, because the usefulness of this test is limited anyway, the real test is when it's deployed and you and your community start putting people in these groups [20:50:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [20:50:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [20:51:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 12 hosts with reason: Maint T352010 [20:51:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:51:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 12 hosts with reason: Maint T352010 [20:51:54] OK it looks good to me, proceeding with the deployment [20:51:59] !log catrope@deploy1002 aram and catrope: Continuing with sync [20:53:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [20:53:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [20:54:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 17 hosts with reason: Maint T343718 [20:54:56] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:54:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 17 hosts with reason: Maint T343718 [20:55:44] RoanKattouw Thanks for the merging. I just have intalled that debug extension, but I am very new and don't know how to use it. [20:56:37] Go to https://ckb.wiktionary.org/wiki/%D8%AA%D8%A7%DB%8C%D8%A8%DB%95%D8%AA:ListGroupRights , and check that you don't see "suppressredirect" listed [20:57:06] Then click on the wikimedia icon in the top right/left corner (the extension should have added this) and change the switch from "off" to "on" [20:57:13] Then refresh the page, and "suppressredirect" should appear [20:57:30] When you have the switch "on", it lets you see code that is in pre-deploy testing [20:57:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [20:57:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [20:58:11] This is what it looks like in the top right corner of my browser (may be top left corner if you use an RTL language) https://usercontent.irccloud-cdn.com/file/vCGESLiF/image.png [20:58:53] Or when it's off it looks like this: https://usercontent.irccloud-cdn.com/file/rn93LQGD/image.png [20:59:25] (SystemdUnitFailed) firing: (3) elasticsearch-disable-readahead.service on elastic2109:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:18] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1010938|Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary (T360228)]] (duration: 17m 37s) [21:03:33] T360228: Add autopatrolled, rollbacker and suppressredirect user groups for ckbwiktionary - https://phabricator.wikimedia.org/T360228 [21:07:31] RoanKattouw, Thanks, I checked and saw the three groups were added. Anything else I need to do? Can I close the phab task now? [21:07:39] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [21:07:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9663598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [21:07:55] Aram46: Yes you can close the task [21:08:40] Really thank you! I learned more things. [21:08:42] Aram46: Also if you / the local admins want to provide a translated description of the "suppressredirect" group name, you can do that my editing https://ckb.wiktionary.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%D9%81%DB%95%D8%B1%DA%BE%DB%95%D9%86%DA%AF:suppressredirect&action=edit&redlink=1 [21:08:50] Otherwise it'll keep being displayed in English [21:09:20] And you / local admins should now be able to add users to these new groups [21:10:14] Yes, you are right. We will do it later. [21:12:25] (03PS1) 10Cwhite: beta-logs: replace logging-logstash-01 with -03 [puppet] - 10https://gerrit.wikimedia.org/r/1014062 (https://phabricator.wikimedia.org/T353912) [21:13:26] (03CR) 10Cwhite: [C:03+2] beta-logs: replace logging-logstash-01 with -03 [puppet] - 10https://gerrit.wikimedia.org/r/1014062 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [21:30:42] (03PS1) 10Cwhite: logstash: enable openjdk-17 support [puppet] - 10https://gerrit.wikimedia.org/r/1014063 (https://phabricator.wikimedia.org/T353912) [21:35:39] (03CR) 10Cwhite: [C:03+2] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/1014063/1739/" [puppet] - 10https://gerrit.wikimedia.org/r/1014063 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [21:38:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:38:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2005.codfw.wmnet with OS bullseye [21:39:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9663755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye completed:... [21:42:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2100-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:45:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: cycle some masters - ryankemper@cumin2002 - T358882 [21:45:36] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [21:47:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2100-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:50:23] (03PS1) 10Daimona Eaytoy: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) [21:51:37] (03CR) 10CI reject: [V:04-1] Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [21:52:57] (03PS1) 10Daimona Eaytoy: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) [21:53:40] (03CR) 10CI reject: [V:04-1] Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [21:54:48] (03PS1) 10Daimona Eaytoy: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) [21:55:32] (03CR) 10CI reject: [V:04-1] Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [21:55:57] (03PS1) 10Daimona Eaytoy: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) [21:56:42] (03CR) 10CI reject: [V:04-1] Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [21:57:01] (03PS1) 10Daimona Eaytoy: Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) [21:57:56] (03CR) 10CI reject: [V:04-1] Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [21:58:32] (03PS2) 10Daimona Eaytoy: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) [21:59:17] (03CR) 10CI reject: [V:04-1] Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [22:00:32] (03PS3) 10Daimona Eaytoy: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) [22:00:44] (03PS1) 10Cwhite: logstash: introduce java_package option [puppet] - 10https://gerrit.wikimedia.org/r/1014064 (https://phabricator.wikimedia.org/T353912) [22:02:33] (03PS2) 10Daimona Eaytoy: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) [22:02:42] (03PS2) 10Daimona Eaytoy: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) [22:03:15] (03PS3) 10Daimona Eaytoy: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) [22:03:17] (03CR) 10CI reject: [V:04-1] Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [22:03:34] (03PS3) 10Daimona Eaytoy: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) [22:07:28] (03PS2) 10Cwhite: logstash: introduce java_package option [puppet] - 10https://gerrit.wikimedia.org/r/1014064 (https://phabricator.wikimedia.org/T353912) [22:08:50] (03PS2) 10Daimona Eaytoy: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) [22:08:54] (03PS3) 10Daimona Eaytoy: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) [22:08:59] (03PS2) 10Daimona Eaytoy: Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) [22:09:33] (03PS3) 10Cwhite: logstash: introduce java_package option [puppet] - 10https://gerrit.wikimedia.org/r/1014064 (https://phabricator.wikimedia.org/T353912) [22:13:39] (03PS1) 10Btullis: Migrate datahub to use external-services for CAS IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014065 (https://phabricator.wikimedia.org/T331894) [22:15:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [22:16:24] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9663941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [22:18:15] (03CR) 10Btullis: [C:03+2] Migrate datahub to use external-services for CAS IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014065 (https://phabricator.wikimedia.org/T331894) (owner: 10Btullis) [22:19:15] (03Merged) 10jenkins-bot: Migrate datahub to use external-services for CAS IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014065 (https://phabricator.wikimedia.org/T331894) (owner: 10Btullis) [22:19:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for mpham - https://phabricator.wikimedia.org/T360641#9663944 (10MPhamWMF) @jcrespo , Looks like I have access. Thanks, and sorry for the confusion with the duplicate request (I didn't realize either) [22:20:19] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:23:13] (03PS2) 10Volans: DNS-related cookbooks: adapt for conftool state [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) [22:23:14] (03PS1) 10Volans: sre.puppet.migrate-host: add --no-downtime flag [cookbooks] - 10https://gerrit.wikimedia.org/r/1014633 [22:24:23] (03CR) 10Cwhite: [C:03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/output/1014064/1742/" [puppet] - 10https://gerrit.wikimedia.org/r/1014064 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [22:24:27] (03PS1) 10Reedy: PopulateEditCount: Look for existing vote rows to find a starting point in case of resume [extensions/SecurePoll] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1014460 [22:24:27] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [22:24:35] (03PS1) 10Reedy: PopulateEditCount: Look for existing vote rows to find a starting point in case of resume [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014461 [22:24:41] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [22:25:06] (03PS1) 10Urbanecm: Add CommunityConfiguration log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) [22:25:19] (03CR) 10Reedy: [C:03+2] PopulateEditCount: Look for existing vote rows to find a starting point in case of resume [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014461 (owner: 10Reedy) [22:25:26] (03CR) 10Reedy: [C:03+2] PopulateEditCount: Look for existing vote rows to find a starting point in case of resume [extensions/SecurePoll] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1014460 (owner: 10Reedy) [22:25:30] (03CR) 10Volans: [C:03+1] "Feel free to merge when you see fit. It would be good to have serviceops blessing too." [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [22:25:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2042-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:25:49] (03PS2) 10Urbanecm: Add CommunityConfiguration log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) [22:27:02] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1014633 (owner: 10Volans) [22:27:03] (03PS1) 10Btullis: Fix typo in the external-services values for datahub staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014646 (https://phabricator.wikimedia.org/T359423) [22:27:57] (03Merged) 10jenkins-bot: PopulateEditCount: Look for existing vote rows to find a starting point in case of resume [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014461 (owner: 10Reedy) [22:28:01] (03Merged) 10jenkins-bot: PopulateEditCount: Look for existing vote rows to find a starting point in case of resume [extensions/SecurePoll] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1014460 (owner: 10Reedy) [22:29:22] (03CR) 10Volans: [C:03+2] sre.puppet.migrate-host: add --no-downtime flag [cookbooks] - 10https://gerrit.wikimedia.org/r/1014633 (owner: 10Volans) [22:30:39] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2042-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:30:59] !log reedy@deploy1002 Started scap: SecurePoll PopulateEditCount fix [22:32:47] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [22:32:58] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [22:34:23] (03Merged) 10jenkins-bot: sre.puppet.migrate-host: add --no-downtime flag [cookbooks] - 10https://gerrit.wikimedia.org/r/1014633 (owner: 10Volans) [22:38:37] (03PS2) 10Btullis: Fix typo in the external-services values for datahub staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014646 (https://phabricator.wikimedia.org/T359423) [22:39:37] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [22:42:51] (03PS3) 10Btullis: Fix whitespace in the external-services values for datahub staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014646 (https://phabricator.wikimedia.org/T359423) [22:45:16] (03PS4) 10Btullis: Fix whitespace in the external-services values for datahub staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014646 (https://phabricator.wikimedia.org/T359423) [22:53:14] (03CR) 10Btullis: [C:03+2] Fix whitespace in the external-services values for datahub staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014646 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [22:54:23] (03Merged) 10jenkins-bot: Fix whitespace in the external-services values for datahub staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014646 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [22:55:45] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [22:56:48] !log reedy@deploy1002 Finished scap: SecurePoll PopulateEditCount fix (duration: 25m 49s) [22:58:58] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [23:00:07] (03CR) 10Cwhite: [C:03+1] alert: Update hiera entries for alert2001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [23:01:27] (03CR) 10Cwhite: [C:03+1] alert: Update hiera entries for alert1001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [23:02:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58931 and previous config saved to /var/cache/conftool/dbconfig/20240326-230220-ladsgroup.json [23:02:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P58932 and previous config saved to /var/cache/conftool/dbconfig/20240326-231728-ladsgroup.json [23:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:30:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [23:32:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P58934 and previous config saved to /var/cache/conftool/dbconfig/20240326-233235-ladsgroup.json [23:45:01] (03CR) 10RLazarus: "At the time we added this test, the Barack Obama page did consistently load within the default timeout, and we wanted a test to make sure " [puppet] - 10https://gerrit.wikimedia.org/r/1014425 (https://phabricator.wikimedia.org/T360867) (owner: 10Hashar) [23:47:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58935 and previous config saved to /var/cache/conftool/dbconfig/20240326-234743-ladsgroup.json [23:47:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:47:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:47:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:48:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T352010)', diff saved to https://phabricator.wikimedia.org/P58936 and previous config saved to /var/cache/conftool/dbconfig/20240326-234806-ladsgroup.json