[00:34:21] (03PS1) 10Ssingh: wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) [00:35:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Disable legacy SSL port — T339299 - eevans@cumin1001 [00:36:02] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951104 [00:38:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951104 (owner: 10TrainBranchBot) [00:41:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:46:59] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951104 (owner: 10TrainBranchBot) [01:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:22:34] (03PS1) 10Andrea Denisse: icinga: Add notification when purging nagios resources [puppet] - 10https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) [01:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:42:37] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/951592/42980/" [puppet] - 10https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) (owner: 10Andrea Denisse) [02:11:43] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:43] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:40:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:41:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:45:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [03:45:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [03:45:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T344589)', diff saved to https://phabricator.wikimedia.org/P50966 and previous config saved to /var/cache/conftool/dbconfig/20230823-034519-ladsgroup.json [03:45:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [03:45:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [03:45:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:45:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:45:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T344589)', diff saved to https://phabricator.wikimedia.org/P50967 and previous config saved to /var/cache/conftool/dbconfig/20230823-034549-ladsgroup.json [03:46:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:46:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [03:46:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:46:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [03:46:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50968 and previous config saved to /var/cache/conftool/dbconfig/20230823-034643-ladsgroup.json [03:46:47] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [03:46:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P50969 and previous config saved to /var/cache/conftool/dbconfig/20230823-034656-ladsgroup.json [03:50:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T344589)', diff saved to https://phabricator.wikimedia.org/P50970 and previous config saved to /var/cache/conftool/dbconfig/20230823-035042-ladsgroup.json [03:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T344589)', diff saved to https://phabricator.wikimedia.org/P50971 and previous config saved to /var/cache/conftool/dbconfig/20230823-035157-ladsgroup.json [03:54:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:54:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [03:57:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [03:57:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1028 (T344589)', diff saved to https://phabricator.wikimedia.org/P50972 and previous config saved to /var/cache/conftool/dbconfig/20230823-035707-ladsgroup.json [04:01:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T344589)', diff saved to https://phabricator.wikimedia.org/P50973 and previous config saved to /var/cache/conftool/dbconfig/20230823-040158-ladsgroup.json [04:02:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:02:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P50974 and previous config saved to /var/cache/conftool/dbconfig/20230823-040207-ladsgroup.json [04:05:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50975 and previous config saved to /var/cache/conftool/dbconfig/20230823-040548-ladsgroup.json [04:07:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50976 and previous config saved to /var/cache/conftool/dbconfig/20230823-040704-ladsgroup.json [04:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P50977 and previous config saved to /var/cache/conftool/dbconfig/20230823-041704-ladsgroup.json [04:17:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P50978 and previous config saved to /var/cache/conftool/dbconfig/20230823-041712-ladsgroup.json [04:19:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:19:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:20:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50979 and previous config saved to /var/cache/conftool/dbconfig/20230823-042054-ladsgroup.json [04:22:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50980 and previous config saved to /var/cache/conftool/dbconfig/20230823-042210-ladsgroup.json [04:25:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:27:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50981 and previous config saved to /var/cache/conftool/dbconfig/20230823-042732-ladsgroup.json [04:27:37] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:32:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P50982 and previous config saved to /var/cache/conftool/dbconfig/20230823-043210-ladsgroup.json [04:32:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P50983 and previous config saved to /var/cache/conftool/dbconfig/20230823-043216-ladsgroup.json [04:33:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [04:34:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:36:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T344589)', diff saved to https://phabricator.wikimedia.org/P50984 and previous config saved to /var/cache/conftool/dbconfig/20230823-043600-ladsgroup.json [04:36:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [04:36:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [04:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T344589)', diff saved to https://phabricator.wikimedia.org/P50985 and previous config saved to /var/cache/conftool/dbconfig/20230823-043625-ladsgroup.json [04:37:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T344589)', diff saved to https://phabricator.wikimedia.org/P50986 and previous config saved to /var/cache/conftool/dbconfig/20230823-043716-ladsgroup.json [04:37:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [04:37:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [04:37:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T344589)', diff saved to https://phabricator.wikimedia.org/P50987 and previous config saved to /var/cache/conftool/dbconfig/20230823-043741-ladsgroup.json [04:39:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50988 and previous config saved to /var/cache/conftool/dbconfig/20230823-044238-ladsgroup.json [04:42:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T344589)', diff saved to https://phabricator.wikimedia.org/P50989 and previous config saved to /var/cache/conftool/dbconfig/20230823-044251-ladsgroup.json [04:43:35] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587 [04:43:53] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587 [04:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T344589)', diff saved to https://phabricator.wikimedia.org/P50990 and previous config saved to /var/cache/conftool/dbconfig/20230823-044356-ladsgroup.json [04:44:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587 [04:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T344589)', diff saved to https://phabricator.wikimedia.org/P50991 and previous config saved to /var/cache/conftool/dbconfig/20230823-044717-ladsgroup.json [04:47:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: Maintenance [04:47:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: Maintenance [04:47:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1031 (T344589)', diff saved to https://phabricator.wikimedia.org/P50992 and previous config saved to /var/cache/conftool/dbconfig/20230823-044741-ladsgroup.json [04:48:03] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 99 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 191, active_shards: 283, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 97, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, numbe [04:48:03] flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 74.08376963350786 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:49:29] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 191, active_shards: 382, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [04:49:29] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:50:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:50:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:50:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50993 and previous config saved to /var/cache/conftool/dbconfig/20230823-045038-ladsgroup.json [04:50:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:53:37] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587 [04:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50994 and previous config saved to /var/cache/conftool/dbconfig/20230823-045606-ladsgroup.json [04:56:09] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [04:56:10] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [04:56:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T344589)', diff saved to https://phabricator.wikimedia.org/P50995 and previous config saved to /var/cache/conftool/dbconfig/20230823-045625-ladsgroup.json [04:57:43] (SystemdUnitFailed) resolved: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50996 and previous config saved to /var/cache/conftool/dbconfig/20230823-045744-ladsgroup.json [04:57:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P50997 and previous config saved to /var/cache/conftool/dbconfig/20230823-045757-ladsgroup.json [04:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50998 and previous config saved to /var/cache/conftool/dbconfig/20230823-045902-ladsgroup.json [05:03:35] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50999 and previous config saved to /var/cache/conftool/dbconfig/20230823-051112-ladsgroup.json [05:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P51000 and previous config saved to /var/cache/conftool/dbconfig/20230823-051131-ladsgroup.json [05:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P51001 and previous config saved to /var/cache/conftool/dbconfig/20230823-051251-ladsgroup.json [05:12:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [05:12:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P51002 and previous config saved to /var/cache/conftool/dbconfig/20230823-051303-ladsgroup.json [05:13:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [05:13:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51003 and previous config saved to /var/cache/conftool/dbconfig/20230823-051312-ladsgroup.json [05:14:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P51004 and previous config saved to /var/cache/conftool/dbconfig/20230823-051409-ladsgroup.json [05:26:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P51005 and previous config saved to /var/cache/conftool/dbconfig/20230823-052618-ladsgroup.json [05:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P51006 and previous config saved to /var/cache/conftool/dbconfig/20230823-052637-ladsgroup.json [05:28:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T344589)', diff saved to https://phabricator.wikimedia.org/P51007 and previous config saved to /var/cache/conftool/dbconfig/20230823-052809-ladsgroup.json [05:28:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [05:28:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [05:28:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T344589)', diff saved to https://phabricator.wikimedia.org/P51008 and previous config saved to /var/cache/conftool/dbconfig/20230823-052834-ladsgroup.json [05:29:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T344589)', diff saved to https://phabricator.wikimedia.org/P51009 and previous config saved to /var/cache/conftool/dbconfig/20230823-052915-ladsgroup.json [05:29:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [05:29:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [05:29:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T344589)', diff saved to https://phabricator.wikimedia.org/P51010 and previous config saved to /var/cache/conftool/dbconfig/20230823-052939-ladsgroup.json [05:29:45] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1463 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:29:45] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1452 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:29:45] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2451 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1396 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:29:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1394 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:29:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2436 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2447 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2439 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2302 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1378 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:29:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2437 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2432 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:29:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1413 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:29:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2418 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:09] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1463 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:31:09] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1452 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:31:11] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2451 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:11] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1396 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:31:11] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1394 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2436 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2447 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2439 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2302 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1378 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2432 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:13] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2437 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:31:14] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1413 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd [05:31:14] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2418 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd [05:34:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T344589)', diff saved to https://phabricator.wikimedia.org/P51011 and previous config saved to /var/cache/conftool/dbconfig/20230823-053454-ladsgroup.json [05:35:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T344589)', diff saved to https://phabricator.wikimedia.org/P51012 and previous config saved to /var/cache/conftool/dbconfig/20230823-053553-ladsgroup.json [05:41:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P51013 and previous config saved to /var/cache/conftool/dbconfig/20230823-054124-ladsgroup.json [05:41:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [05:41:29] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:41:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [05:41:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T344589)', diff saved to https://phabricator.wikimedia.org/P51014 and previous config saved to /var/cache/conftool/dbconfig/20230823-054144-ladsgroup.json [05:50:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P51015 and previous config saved to /var/cache/conftool/dbconfig/20230823-055000-ladsgroup.json [05:50:41] (03CR) 10Zabe: [C: 03+2] wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [05:50:52] !log zabe@deploy1002 Backport cancelled. [05:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P51016 and previous config saved to /var/cache/conftool/dbconfig/20230823-055059-ladsgroup.json [05:51:03] zabe: please don't deploy that reverse-proxy change [05:51:06] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:51:09] (03CR) 10Majavah: [C: 04-2] wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [05:51:38] (03PS1) 10Marostegui: dbproxy1012: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951709 [05:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P51017 and previous config saved to /var/cache/conftool/dbconfig/20230823-055215-ladsgroup.json [05:52:19] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [05:53:16] (03CR) 10Marostegui: [C: 03+2] dbproxy1012: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951709 (owner: 10Marostegui) [05:53:43] (03CR) 10Majavah: [C: 04-1] "So there's one specific edge case here, which is cloudweb* (wikitech app servers). Removing the edges is ok, but for eqiad and codfw we ne" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [05:54:04] taavi: is there a reason for that? [05:55:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51018 and previous config saved to /var/cache/conftool/dbconfig/20230823-055500-ladsgroup.json [05:55:05] (03PS1) 10Marostegui: dbproxy1013: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951710 [05:55:53] (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951710 (owner: 10Marostegui) [05:56:12] zabe: I left a comment on the patch, but basically as is it would currently break wikitech [05:56:48] ok [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0600) [06:02:12] (03PS1) 10Marostegui: dbproxy1014: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951711 [06:02:58] (03CR) 10Marostegui: [C: 03+2] dbproxy1014: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951711 (owner: 10Marostegui) [06:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P51019 and previous config saved to /var/cache/conftool/dbconfig/20230823-060506-ladsgroup.json [06:05:43] (03CR) 10Gmodena: [C: 03+2] Expose mediawiki.page_change.v1 publicly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/951426 (https://phabricator.wikimedia.org/T336817) (owner: 10Gmodena) [06:06:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P51020 and previous config saved to /var/cache/conftool/dbconfig/20230823-060606-ladsgroup.json [06:06:34] (03Merged) 10jenkins-bot: Expose mediawiki.page_change.v1 publicly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/951426 (https://phabricator.wikimedia.org/T336817) (owner: 10Gmodena) [06:07:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P51021 and previous config saved to /var/cache/conftool/dbconfig/20230823-060721-ladsgroup.json [06:10:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P51022 and previous config saved to /var/cache/conftool/dbconfig/20230823-061007-ladsgroup.json [06:18:51] Short Gerrit maintenance starts in 15 minutes. It will take around 10 minutes. [06:20:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T344589)', diff saved to https://phabricator.wikimedia.org/P51023 and previous config saved to /var/cache/conftool/dbconfig/20230823-062013-ladsgroup.json [06:20:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [06:20:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [06:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T344589)', diff saved to https://phabricator.wikimedia.org/P51024 and previous config saved to /var/cache/conftool/dbconfig/20230823-062038-ladsgroup.json [06:21:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T344589)', diff saved to https://phabricator.wikimedia.org/P51025 and previous config saved to /var/cache/conftool/dbconfig/20230823-062112-ladsgroup.json [06:21:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:21:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:21:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T344589)', diff saved to https://phabricator.wikimedia.org/P51026 and previous config saved to /var/cache/conftool/dbconfig/20230823-062136-ladsgroup.json [06:22:13] (03CR) 10Jelto: [C: 03+2] gerrit: raise maxConnectionsPerUser to 8 [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [06:22:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P51027 and previous config saved to /var/cache/conftool/dbconfig/20230823-062227-ladsgroup.json [06:25:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P51028 and previous config saved to /var/cache/conftool/dbconfig/20230823-062513-ladsgroup.json [06:27:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T344589)', diff saved to https://phabricator.wikimedia.org/P51029 and previous config saved to /var/cache/conftool/dbconfig/20230823-062701-ladsgroup.json [06:30:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0600) [06:30:04] Deploy window [https://wikitech.wikimedia.org/wiki/Gerrit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0630) [06:31:44] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:03] Gerrit will restart now, should take less than 10 minutes [06:33:08] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org [06:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P51030 and previous config saved to /var/cache/conftool/dbconfig/20230823-063733-ladsgroup.json [06:37:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [06:37:38] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:37:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [06:37:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P51031 and previous config saved to /var/cache/conftool/dbconfig/20230823-063754-ladsgroup.json [06:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51032 and previous config saved to /var/cache/conftool/dbconfig/20230823-064019-ladsgroup.json [06:40:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [06:40:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [06:40:37] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org [06:41:00] Gerrit maintenance done [06:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P51033 and previous config saved to /var/cache/conftool/dbconfig/20230823-064207-ladsgroup.json [06:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P51034 and previous config saved to /var/cache/conftool/dbconfig/20230823-064421-ladsgroup.json [06:44:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:53:31] (03PS1) 10Marostegui: dbproxy1015: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951821 [06:53:56] (03CR) 10Marostegui: [C: 03+2] dbproxy1015: Host decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/951821 (owner: 10Marostegui) [06:54:35] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [06:54:44] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10MoritzMuehlenhoff) [06:57:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P51035 and previous config saved to /var/cache/conftool/dbconfig/20230823-065714-ladsgroup.json [06:59:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P51036 and previous config saved to /var/cache/conftool/dbconfig/20230823-065927-ladsgroup.json [07:00:04] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:09:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/951534 (owner: 10Jbond) [07:12:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T344589)', diff saved to https://phabricator.wikimedia.org/P51037 and previous config saved to /var/cache/conftool/dbconfig/20230823-071220-ladsgroup.json [07:12:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [07:12:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [07:12:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:12:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [07:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T344589)', diff saved to https://phabricator.wikimedia.org/P51038 and previous config saved to /var/cache/conftool/dbconfig/20230823-071249-ladsgroup.json [07:13:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org [07:14:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P51039 and previous config saved to /var/cache/conftool/dbconfig/20230823-071433-ladsgroup.json [07:17:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2002.wikimedia.org [07:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T344589)', diff saved to https://phabricator.wikimedia.org/P51040 and previous config saved to /var/cache/conftool/dbconfig/20230823-071916-ladsgroup.json [07:19:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [07:19:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [07:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51041 and previous config saved to /var/cache/conftool/dbconfig/20230823-071953-ladsgroup.json [07:19:57] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:25:24] (03CR) 10Muehlenhoff: Adapt monitoring/metrics rules for nft and ferm providers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:29:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P51042 and previous config saved to /var/cache/conftool/dbconfig/20230823-072940-ladsgroup.json [07:29:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [07:29:45] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:29:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [07:30:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P51043 and previous config saved to /var/cache/conftool/dbconfig/20230823-073001-ladsgroup.json [07:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P51044 and previous config saved to /var/cache/conftool/dbconfig/20230823-073422-ladsgroup.json [07:34:57] (03CR) 10JMeybohm: "I would argue that this is a prometheus alert, not a k8s one (similar to JobUnavailable but more explicit and with higher severity). Shoul" [alerts] - 10https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [07:35:21] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:21] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:21] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:21] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:21] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P51045 and previous config saved to /var/cache/conftool/dbconfig/20230823-073529-ladsgroup.json [07:35:33] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:35:35] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:35] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:35] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:35:41] PROBLEM - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:35:53] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:02] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [07:36:15] PROBLEM - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:36:17] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:17] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:17] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:21] PROBLEM - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:36:43] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:43] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:43] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:45] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:36:59] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:37:15] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:37:35] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:37:35] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:37:39] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:38:09] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:38:19] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:38:21] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:39:55] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [07:40:19] (03PS1) 10Muehlenhoff: Make firewall logging conditional on ferm and rename the profile [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) [07:40:42] (03CR) 10CI reject: [V: 04-1] Make firewall logging conditional on ferm and rename the profile [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:42:46] (03PS2) 10Muehlenhoff: Make firewall logging conditional on ferm and rename the profile [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) [07:48:00] etherpad needs a short maintenance in 15 minutes [07:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P51046 and previous config saved to /var/cache/conftool/dbconfig/20230823-074928-ladsgroup.json [07:49:47] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P51047 and previous config saved to /var/cache/conftool/dbconfig/20230823-075035-ladsgroup.json [07:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:53:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:53:37] PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [07:56:19] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_ulsfo and A:cp [07:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:00:16] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host etherpad1003.eqiad.wmnet [08:00:17] !log running puppet agent on lvs5006 [08:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51048 and previous config saved to /var/cache/conftool/dbconfig/20230823-080127-ladsgroup.json [08:01:34] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:02:47] (03PS1) 10Muehlenhoff: Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) [08:03:11] PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:03:52] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [08:04:15] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1003.eqiad.wmnet [08:04:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T344589)', diff saved to https://phabricator.wikimedia.org/P51049 and previous config saved to /var/cache/conftool/dbconfig/20230823-080435-ladsgroup.json [08:04:37] RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:04:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [08:04:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [08:05:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51050 and previous config saved to /var/cache/conftool/dbconfig/20230823-080500-ladsgroup.json [08:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P51051 and previous config saved to /var/cache/conftool/dbconfig/20230823-080541-ladsgroup.json [08:07:43] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [08:08:50] (03PS1) 10Fabfur: haproxy: sanitize eventual duplicate content-length header [puppet] - 10https://gerrit.wikimedia.org/r/951832 (https://phabricator.wikimedia.org/T344047) [08:09:00] (03PS1) 10Dreamy Jazz: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951846 (https://phabricator.wikimedia.org/T344787) [08:09:12] (03PS1) 10Dreamy Jazz: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951847 (https://phabricator.wikimedia.org/T344787) [08:11:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51052 and previous config saved to /var/cache/conftool/dbconfig/20230823-081122-ladsgroup.json [08:12:05] (03CR) 10Btullis: [C: 03+1] switch an-worker[17-48] to reuse-analytics-hadoop recipe [puppet] - 10https://gerrit.wikimedia.org/r/951458 (https://phabricator.wikimedia.org/T332570) (owner: 10Stevemunene) [08:14:18] (03CR) 10Btullis: [C: 03+1] "Definitely worth a try." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:15:42] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [08:16:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P51053 and previous config saved to /var/cache/conftool/dbconfig/20230823-081633-ladsgroup.json [08:20:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P51054 and previous config saved to /var/cache/conftool/dbconfig/20230823-082047-ladsgroup.json [08:20:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [08:20:52] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:21:00] (03CR) 10Jbond: [C: 03+2] ferm::service: make port optional so we can use port_range [puppet] - 10https://gerrit.wikimedia.org/r/951534 (owner: 10Jbond) [08:21:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [08:21:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:21:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:21:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P51055 and previous config saved to /var/cache/conftool/dbconfig/20230823-082116-ladsgroup.json [08:21:20] (03PS1) 10Dreamy Jazz: clienthints: Lower API max lag time to 5 minutes on group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) [08:21:28] !log btullis@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka jumbo-eqiad cluster: Reboot kafka nodes [08:24:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1011.eqiad.wmnet [08:26:20] (03CR) 10Stevemunene: [C: 03+2] datahub: set the oidc client authentication method [deployment-charts] - 10https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:26:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P51056 and previous config saved to /var/cache/conftool/dbconfig/20230823-082628-ladsgroup.json [08:26:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P51057 and previous config saved to /var/cache/conftool/dbconfig/20230823-082657-ladsgroup.json [08:27:01] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:27:11] (03Merged) 10jenkins-bot: datahub: set the oidc client authentication method [deployment-charts] - 10https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:27:36] (03PS1) 10JMeybohm: deployment_server::helmfile: Iterate over clusters, not services [puppet] - 10https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) [08:28:09] PROBLEM - Restbase root url on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/RESTBase [08:28:27] (03CR) 10Jbond: [C: 03+1] "i would have probably gone with profile::ferm::log but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:28:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:28:53] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:29:26] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:29:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1011.eqiad.wmnet [08:29:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42981/console" [puppet] - 10https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [08:31:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P51058 and previous config saved to /var/cache/conftool/dbconfig/20230823-083140-ladsgroup.json [08:32:12] (03PS2) 10JMeybohm: deployment_server::helmfile: Iterate over clusters groups first [puppet] - 10https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) [08:34:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42982/console" [puppet] - 10https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [08:34:35] (03CR) 10Muehlenhoff: Make firewall logging conditional on ferm and rename the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:34:45] PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:49] !log fetch HAProxy 2.6.15 on thirdparty/haproxy26 for bullseye (apt.wm.o) - T344047 [08:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:57] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jgiannelos) On the OSM sync side of things, it might worth checking if the system catches up with the diffs (~1 week worth of diffs could be manageable). The idea is that we avoid havin... [08:36:09] RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:39:07] (03PS1) 10JMeybohm: deployment_server/kubernetes: Readd admin_services secrets [labs/private] - 10https://gerrit.wikimedia.org/r/951836 (https://phabricator.wikimedia.org/T297417) [08:41:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P51059 and previous config saved to /var/cache/conftool/dbconfig/20230823-084134-ladsgroup.json [08:42:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P51060 and previous config saved to /var/cache/conftool/dbconfig/20230823-084203-ladsgroup.json [08:43:15] PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:40] (03PS1) 10Stevemunene: datahub:main chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) [08:46:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51061 and previous config saved to /var/cache/conftool/dbconfig/20230823-084646-ladsgroup.json [08:46:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [08:46:52] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:47:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [08:47:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:47:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:47:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P51062 and previous config saved to /var/cache/conftool/dbconfig/20230823-084711-ladsgroup.json [08:47:37] !log run puppet agent on lvs5004 to clear alert [08:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:58] 10sre-alert-triage, 10Data-Platform-SRE: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @gmodena [08:52:25] RECOVERY - Check systemd state on kubernetes2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:35] RECOVERY - Check systemd state on kubernetes1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:33] RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:55] (03PS6) 10Clément Goubert: k8s::proxy: Start kube-proxy after ferm [puppet] - 10https://gerrit.wikimedia.org/r/915461 [08:56:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51063 and previous config saved to /var/cache/conftool/dbconfig/20230823-085640-ladsgroup.json [08:56:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [08:57:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [08:57:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T344589)', diff saved to https://phabricator.wikimedia.org/P51064 and previous config saved to /var/cache/conftool/dbconfig/20230823-085706-ladsgroup.json [08:57:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P51065 and previous config saved to /var/cache/conftool/dbconfig/20230823-085715-ladsgroup.json [08:57:29] 10SRE, 10Data-Platform-SRE, 10User-MoritzMuehlenhoff: Configure the Hadoop MapReduce ports to use a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) [08:58:22] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2042.codfw.wmnet} and A:cp [08:59:13] (03PS7) 10Clément Goubert: k8s::proxy: Start kube-proxy after ferm [puppet] - 10https://gerrit.wikimedia.org/r/915461 [08:59:19] !log update to HAProxy 2.6.15 in cp2042 (upload) - T344047 [08:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2042.codfw.wmnet} and A:cp [09:01:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P51066 and previous config saved to /var/cache/conftool/dbconfig/20230823-090147-ladsgroup.json [09:01:51] PROBLEM - SSH on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:01:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:02:27] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42984/console" [puppet] - 10https://gerrit.wikimedia.org/r/915461 (owner: 10Clément Goubert) [09:03:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T344589)', diff saved to https://phabricator.wikimedia.org/P51067 and previous config saved to /var/cache/conftool/dbconfig/20230823-090332-ladsgroup.json [09:05:09] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2041.codfw.wmnet} and A:cp [09:05:23] !log update to HAProxy 2.6.15 in cp2041 (text) - T344047 [09:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:03] (03PS1) 10JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) [09:06:57] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2041.codfw.wmnet} and A:cp [09:07:51] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:09:46] (03CR) 10Btullis: [C: 03+1] datahub:main chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [09:10:03] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:11:17] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [09:12:16] (03PS2) 10JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) [09:12:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P51068 and previous config saved to /var/cache/conftool/dbconfig/20230823-091221-ladsgroup.json [09:12:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [09:12:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:12:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [09:12:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P51069 and previous config saved to /var/cache/conftool/dbconfig/20230823-091242-ladsgroup.json [09:13:05] (03CR) 10Clément Goubert: [V: 03+1] "Bumping this because it happened again after the last round of kubelet restarts." [puppet] - 10https://gerrit.wikimedia.org/r/915461 (owner: 10Clément Goubert) [09:16:39] (03CR) 10Stevemunene: [C: 03+2] datahub:main chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [09:16:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P51070 and previous config saved to /var/cache/conftool/dbconfig/20230823-091653-ladsgroup.json [09:17:24] (03Merged) 10jenkins-bot: datahub:main chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [09:18:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P51071 and previous config saved to /var/cache/conftool/dbconfig/20230823-091821-ladsgroup.json [09:18:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:18:37] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:18:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P51072 and previous config saved to /var/cache/conftool/dbconfig/20230823-091838-ladsgroup.json [09:19:14] (03CR) 10Btullis: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:20:06] (03PS3) 10JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) [09:20:35] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/945785 (owner: 10Muehlenhoff) [09:21:02] (03CR) 10JMeybohm: [C: 03+1] k8s::proxy: Start kube-proxy after ferm [puppet] - 10https://gerrit.wikimedia.org/r/915461 (owner: 10Clément Goubert) [09:21:12] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:22:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42987/console" [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [09:22:18] (03CR) 10Jbond: [C: 04-1] "see inline still some issues with the exec" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [09:24:40] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:24:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1009.eqiad.wmnet [09:26:52] (03PS21) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:26:57] (03CR) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:27:06] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:29:07] (03CR) 10Btullis: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:31:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1009.eqiad.wmnet [09:31:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1010.eqiad.wmnet [09:32:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P51073 and previous config saved to /var/cache/conftool/dbconfig/20230823-093200-ladsgroup.json [09:33:04] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] k8s::proxy: Start kube-proxy after ferm [puppet] - 10https://gerrit.wikimedia.org/r/915461 (owner: 10Clément Goubert) [09:33:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P51074 and previous config saved to /var/cache/conftool/dbconfig/20230823-093327-ladsgroup.json [09:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P51075 and previous config saved to /var/cache/conftool/dbconfig/20230823-093345-ladsgroup.json [09:35:51] (03CR) 10Jbond: [C: 03+1] Make firewall logging conditional on ferm and rename the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:37:25] (03CR) 10Jbond: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:37:41] (03CR) 10Jcrespo: "Let me run puppet compiler on bacula dir and on the hosts to make sure it is a noop (or mostly a noop)." [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [09:38:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1010.eqiad.wmnet [09:40:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [09:41:02] (03PS1) 10Muehlenhoff: Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) [09:41:22] (03CR) 10CI reject: [V: 04-1] Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:42:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:42:24] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10MSantos) [09:44:39] (03PS1) 10Jbond: admin: add lwatson to ldap config [puppet] - 10https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772) [09:45:15] (03CR) 10Jbond: [C: 03+2] admin: add lwatson to ldap config [puppet] - 10https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772) (owner: 10Jbond) [09:45:25] (03CR) 10Clément Goubert: [C: 03+1] deployment_server::helmfile: Iterate over clusters groups first [puppet] - 10https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [09:46:12] (03CR) 10Clément Goubert: [C: 03+1] deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [09:46:20] (03PS2) 10Jbond: admin: add lwatson to ldap config [puppet] - 10https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772) [09:46:46] (03CR) 10Jbond: [C: 03+2] admin: add lwatson to ldap config [puppet] - 10https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772) (owner: 10Jbond) [09:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P51078 and previous config saved to /var/cache/conftool/dbconfig/20230823-094706-ladsgroup.json [09:47:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [09:47:11] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:47:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:47:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [09:47:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51079 and previous config saved to /var/cache/conftool/dbconfig/20230823-094727-ladsgroup.json [09:48:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P51081 and previous config saved to /var/cache/conftool/dbconfig/20230823-094834-ladsgroup.json [09:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T344589)', diff saved to https://phabricator.wikimedia.org/P51082 and previous config saved to /var/cache/conftool/dbconfig/20230823-094851-ladsgroup.json [09:48:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [09:49:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [09:49:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51083 and previous config saved to /var/cache/conftool/dbconfig/20230823-094916-ladsgroup.json [09:49:54] (03PS1) 10Jbond: admin: add vriley to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/951891 (https://phabricator.wikimedia.org/T344770) [09:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P51084 and previous config saved to /var/cache/conftool/dbconfig/20230823-095040-ladsgroup.json [09:52:53] (03CR) 10Jbond: [C: 03+2] admin: add vriley to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/951891 (https://phabricator.wikimedia.org/T344770) (owner: 10Jbond) [09:55:25] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for lwatson - https://phabricator.wikimedia.org/T344772 (10jbond) 05Open→03Resolved @lwatson you are now part of the wmf group so should be able to access all the listed sites. [09:57:41] !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [09:57:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51085 and previous config saved to /var/cache/conftool/dbconfig/20230823-095749-ladsgroup.json [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1000) [10:00:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] deployment_server/helmfile: Write admin_services_secrets to files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [10:02:39] (03PS11) 10Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [10:03:07] (03PS4) 10JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) [10:03:22] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Valerie Riley - https://phabricator.wikimedia.org/T344770 (10jbond) 05Open→03Resolved a:03jbond @VRiley-WMF you where already part of the WMF group so should have read-only access to netbox. For shell access please [[ https://w... [10:03:27] (03CR) 10Gmodena: rdf-streaming-updater-dse-k8s: Add Zookeeper HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [10:03:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P51086 and previous config saved to /var/cache/conftool/dbconfig/20230823-100340-ladsgroup.json [10:03:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:03:45] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:03:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:03:58] (03PS12) 10Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [10:04:34] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10JAllemandou) >>! In T343855#9111286, @Htriedman wrote: > 1. Ensure that (a) historical data is loaded into cassandra (currently thi... [10:04:42] (03PS13) 10Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [10:05:08] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42990/console" [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [10:05:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:05:56] (03PS2) 10Muehlenhoff: Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) [10:06:25] (03CR) 10Jelto: [C: 03+1] "lgtm, nice addition!" [puppet] - 10https://gerrit.wikimedia.org/r/951429 (https://phabricator.wikimedia.org/T344620) (owner: 10EoghanGaffney) [10:06:44] (03CR) 10CI reject: [V: 04-1] Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:07:49] (03PS1) 10Jbond: admin: offboard oleksandrtsyba-wmde and nosc [puppet] - 10https://gerrit.wikimedia.org/r/951893 (https://phabricator.wikimedia.org/T344766) [10:08:04] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Add warning banner to replica instances [puppet] - 10https://gerrit.wikimedia.org/r/951429 (https://phabricator.wikimedia.org/T344620) (owner: 10EoghanGaffney) [10:09:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:54] !log temporary depool/repool cp4040 for haproxy service restart [10:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:59] (03PS8) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [10:12:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P51087 and previous config saved to /var/cache/conftool/dbconfig/20230823-101255-ladsgroup.json [10:14:29] !log depool cp2039 to run some HAProxy experiments [10:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:20] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [10:22:34] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Norman Schwirz, Oleksandr Tsyba from WMF systems - https://phabricator.wikimedia.org/T344766 (10jbond) @WMDE-leszek i have dropped the ldap permissions. Are you able to confirm the Phabricator accounts so i can also offboard them from here. thanks [10:23:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "You convinced me this is mostly transitory, so LGTM 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [10:24:09] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:28:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P51088 and previous config saved to /var/cache/conftool/dbconfig/20230823-102801-ladsgroup.json [10:29:13] RECOVERY - Host ml-serve2001 is UP: PING WARNING - Packet loss = 77%, RTA = 31.72 ms [10:29:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:29:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51089 and previous config saved to /var/cache/conftool/dbconfig/20230823-102939-ladsgroup.json [10:29:52] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:30:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/951893 (https://phabricator.wikimedia.org/T344766) (owner: 10Jbond) [10:30:38] (KubernetesCalicoDown) resolved: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:31:44] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:34] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Norman Schwirz, Oleksandr Tsyba from WMF systems - https://phabricator.wikimedia.org/T344766 (10WMDE-leszek) thanks @jbond . Phabricator accounts have been @WMDE_Norman and @oleksandr_tsyba_WMDE - both disabled already [10:37:50] !log repool cp2039 [10:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:19] (03PS1) 10EoghanGaffney: gitlab: Fix paths for backup common functions [puppet] - 10https://gerrit.wikimedia.org/r/951896 (https://phabricator.wikimedia.org/T338332) [10:40:18] !log rolling upgrade to HAProxy 2.6.15 - T344047 [10:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:42] (03CR) 10Muehlenhoff: [C: 03+2] Make firewall logging conditional on ferm and rename the profile [puppet] - 10https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:43:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51090 and previous config saved to /var/cache/conftool/dbconfig/20230823-104308-ladsgroup.json [10:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P51091 and previous config saved to /var/cache/conftool/dbconfig/20230823-104445-ladsgroup.json [10:45:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:46:21] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and not P{cp2041.*} and not P{cp2039.*} and A:cp [10:46:46] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and not P{cp2042.*} and A:cp [10:47:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P51092 and previous config saved to /var/cache/conftool/dbconfig/20230823-104725-ladsgroup.json [10:48:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:33] (03PS3) 10Giuseppe Lavagetto: termbox-test: call mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) [10:49:22] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) [10:49:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] termbox-test: call mw-api-int (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) (owner: 10Giuseppe Lavagetto) [10:50:44] (03Merged) 10jenkins-bot: termbox-test: call mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) (owner: 10Giuseppe Lavagetto) [10:53:03] (03CR) 10Jcrespo: "One second, in the last moment I thought of an option that may be much easier for both of us, but I want to do some tests first!" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [10:54:18] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [10:54:29] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [10:58:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:16] (03CR) 10Sergio Gimeno: [C: 04-1] "Awaiting to inform communities, T308138#9112945" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [10:59:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P51093 and previous config saved to /var/cache/conftool/dbconfig/20230823-105954-ladsgroup.json [11:00:07] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and not P{cp2041.*} and not P{cp2039.*} and A:cp [11:00:47] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [11:01:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and not P{cp2042.*} and A:cp [11:02:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P51094 and previous config saved to /var/cache/conftool/dbconfig/20230823-110231-ladsgroup.json [11:02:37] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp [11:02:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:03:37] (03PS3) 10Hnowlan: service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400) [11:03:39] (03PS2) 10Hnowlan: kubernetes: add users for media_analytics service, cassandra config [puppet] - 10https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380) [11:10:15] (03CR) 10Jbond: [C: 03+2] admin: offboard oleksandrtsyba-wmde and nosc [puppet] - 10https://gerrit.wikimedia.org/r/951893 (https://phabricator.wikimedia.org/T344766) (owner: 10Jbond) [11:11:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:47] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:01] (03PS3) 10Giuseppe Lavagetto: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [11:13:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:14:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:26] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Offboard Norman Schwirz, Oleksandr Tsyba from WMF systems - https://phabricator.wikimedia.org/T344766 (10jbond) 05Open→03Resolved a:03jbond @WMDE-leszek Thanks looks like we are all done then. but please reopen if you see anything else that needs clea... [11:15:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51095 and previous config saved to /var/cache/conftool/dbconfig/20230823-111500-ladsgroup.json [11:15:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:15:57] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [11:17:03] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host atlas2001.wikimedia.org [11:17:04] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:17:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P51096 and previous config saved to /var/cache/conftool/dbconfig/20230823-111737-ladsgroup.json [11:18:12] (03CR) 10Clément Goubert: [C: 03+1] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [11:18:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:03] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas2001.wikimedia.org - ayounsi@cumin1001" [11:20:31] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas2001.wikimedia.org - ayounsi@cumin1001" [11:21:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:21:02] !log ayounsi@cumin1001 START - Cookbook sre.dns.wipe-cache atlas2001.wikimedia.org on all recursors [11:21:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas2001.wikimedia.org on all recursors [11:23:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [11:24:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet [11:24:47] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas2001.wikimedia.org - ayounsi@cumin1001" [11:24:49] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas2001.wikimedia.org - ayounsi@cumin1001" [11:25:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas2001.wikimedia.org [11:25:42] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1002.eqiad.wmnet [11:27:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:10] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951896 (https://phabricator.wikimedia.org/T338332) (owner: 10EoghanGaffney) [11:28:06] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [11:28:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet [11:29:15] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:11] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: Fix paths for backup common functions [puppet] - 10https://gerrit.wikimedia.org/r/951896 (https://phabricator.wikimedia.org/T338332) (owner: 10EoghanGaffney) [11:30:15] (03PS1) 10Hnowlan: service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) [11:30:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp [11:31:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [11:31:33] (03PS1) 10Jbond: netbox: add datacenter-ops group as a super user [puppet] - 10https://gerrit.wikimedia.org/r/951902 (https://phabricator.wikimedia.org/T341581) [11:31:35] (03PS1) 10Jbond: idp: add datacenter-ops group to other services they should have access to [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) [11:32:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1002.eqiad.wmnet [11:32:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P51097 and previous config saved to /var/cache/conftool/dbconfig/20230823-113244-ladsgroup.json [11:32:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:32:59] 10SRE-tools, 10Ganeti, 10Spicerack: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (10ayounsi) [11:33:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:33:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T344589)', diff saved to https://phabricator.wikimedia.org/P51098 and previous config saved to /var/cache/conftool/dbconfig/20230823-113310-ladsgroup.json [11:34:53] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (10ayounsi) [11:35:03] (03PS10) 10Jbond: wmcs: add wmcs-roots to roles where it is missing [puppet] - 10https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848) [11:35:10] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server::helmfile: Iterate over clusters groups first [puppet] - 10https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:35:13] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server/helmfile: Write admin_services_secrets to files [puppet] - 10https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:35:35] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [11:35:38] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] deployment_server/kubernetes: Readd admin_services secrets [labs/private] - 10https://gerrit.wikimedia.org/r/951836 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:35:43] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [11:36:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [11:36:25] (03PS3) 10Muehlenhoff: Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) [11:36:43] (03PS11) 10Jbond: wmcs: add wmcs-roots to roles where it is missing [puppet] - 10https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848) [11:37:14] (03CR) 10CI reject: [V: 04-1] Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:37:58] !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes [11:38:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jcrespo) Hey, I just reached this ticket by accident. Could you refer to me the documentation where there was consensus and... [11:38:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:39:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T344589)', diff saved to https://phabricator.wikimedia.org/P51099 and previous config saved to /var/cache/conftool/dbconfig/20230823-113921-ladsgroup.json [11:39:23] (03CR) 10Jbond: "@Amir, could i get a +1 from you specifically in relation to the comments starting from https://phabricator.wikimedia.org/T344599#9106167" [puppet] - 10https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [11:39:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:40:59] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:23] (03CR) 10Jbond: [C: 03+2] httpyaml: replace URI.escape [puppet] - 10https://gerrit.wikimedia.org/r/919291 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:41:58] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_ulsfo and A:cp [11:41:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:42:46] (03PS4) 10Muehlenhoff: Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) [11:43:27] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:44:00] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (10MoritzMuehlenhoff) [11:44:19] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Spicerack, 10User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (10MoritzMuehlenhoff) [11:46:18] (03PS1) 10JMeybohm: deployment_server/helmfile: Don't define admin_service_dir twice [puppet] - 10https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417) [11:46:55] 10SRE, 10MW-on-K8s, 10serviceops: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (10Clement_Goubert) [11:47:11] (03PS2) 10Muehlenhoff: Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) [11:48:31] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:48:49] (03PS1) 10JMeybohm: Add cfssl-issuer admin secrets to ml-serve [labs/private] - 10https://gerrit.wikimedia.org/r/951908 (https://phabricator.wikimedia.org/T297417) [11:49:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:49:47] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet [11:51:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka jumbo-eqiad cluster: Reboot kafka nodes [11:51:30] (03PS2) 10JMeybohm: deployment_server/helmfile: Don't define admin_service_dir twice [puppet] - 10https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417) [11:53:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42993/console" [puppet] - 10https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:54:16] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server/helmfile: Don't define admin_service_dir twice [puppet] - 10https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:54:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P51100 and previous config saved to /var/cache/conftool/dbconfig/20230823-115427-ladsgroup.json [11:54:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet [11:54:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add cfssl-issuer admin secrets to ml-serve [labs/private] - 10https://gerrit.wikimedia.org/r/951908 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [11:58:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:59:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:59:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [12:00:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [12:01:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:02:02] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951867 [12:02:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:03:34] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrich: stream version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951446 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [12:03:56] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-codfw [12:04:39] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: stream version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/951446 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [12:05:43] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P51101 and previous config saved to /var/cache/conftool/dbconfig/20230823-120933-ladsgroup.json [12:11:47] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:11:51] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:12:01] !log btullis@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test-eqiad cluster: Reboot kafka nodes [12:12:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:14:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-codfw [12:17:51] !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [12:19:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42995/console" [puppet] - 10https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: 10Jaime Nuche) [12:19:06] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad and A:cp [12:19:21] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad and A:cp [12:20:17] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:13] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: 10Jaime Nuche) [12:22:57] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) p:05Medium→03Low >>! In T221083#9113102, @jcrespo wrote: > Hey, I just reached this ticket by accident. you ha... [12:23:43] (03CR) 10Muehlenhoff: idp: add datacenter-ops group to other services they should have access to (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [12:24:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T344589)', diff saved to https://phabricator.wikimedia.org/P51102 and previous config saved to /var/cache/conftool/dbconfig/20230823-122440-ladsgroup.json [12:25:43] (03PS1) 10Zoranzoki21: [pawiki] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951913 (https://phabricator.wikimedia.org/T344815) [12:25:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:26:33] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [12:26:39] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:26:56] (03CR) 10Muehlenhoff: Convert the monitoring/prometheus ferm rules to a firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:27:52] (03PS1) 10Zoranzoki21: [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) [12:27:55] (03CR) 10Btullis: [C: 03+1] "Thanks for this. I'm happy with this, once the CI issue is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:28:11] (03PS2) 10Zoranzoki21: [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) [12:29:25] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-eqiad [12:31:09] (03CR) 10Jbond: "cheers will update" [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [12:31:39] (03PS1) 10JMeybohm: admin_ng: Include admin service secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) [12:31:58] (03PS1) 10Muehlenhoff: Update cookbook header to reflect the fact that we also support VMs these days [cookbooks] - 10https://gerrit.wikimedia.org/r/951916 [12:32:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2004.codfw.wmnet with OS bookworm [12:32:48] (03PS2) 10JMeybohm: admin_ng: Include admin service secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) [12:32:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet [12:34:02] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:34:06] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:34:19] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:34:21] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:35:14] (03CR) 10Muehlenhoff: idp: add datacenter-ops group to other services they should have access to (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [12:37:46] (03PS2) 10Jbond: netbox: add datacenter-ops group as a super user [puppet] - 10https://gerrit.wikimedia.org/r/951902 (https://phabricator.wikimedia.org/T341581) [12:37:48] (03PS2) 10Jbond: idp: add datacenter-ops to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) [12:37:55] (03PS1) 10Jbond: idp: drop superfluous permissions [puppet] - 10https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581) [12:38:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-eqiad [12:40:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet [12:42:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [12:47:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes [12:48:06] !log update jwt-authorizer package to v1.2.0 [12:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:15] (03CR) 10Jbond: "thanks see inline, perhaps its best to move this discussion back to the task?" [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [12:48:39] !log update jwt-authorizer package to v1.2.0 - T337474 [12:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:43] T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474 [12:48:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/951916 (owner: 10Muehlenhoff) [12:49:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [12:49:25] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp [12:49:30] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp [12:54:50] (03CR) 10Jelto: [V: 03+1 C: 03+2] jwt_authorizer: reflect changes to accept multiple issuers [puppet] - 10https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: 10Jaime Nuche) [12:55:55] (03PS1) 10Effie Mouzeli: Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951850 [12:56:09] (03PS2) 10Effie Mouzeli: Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951850 [12:56:35] !log registry* - upgrade jwt-authorizer package on all 4 hosts to version 1.2.0-1 - T337474 [12:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:41] T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474 [12:58:05] (03CR) 10Muehlenhoff: [C: 03+2] Update cookbook header to reflect the fact that we also support VMs these days [cookbooks] - 10https://gerrit.wikimedia.org/r/951916 (owner: 10Muehlenhoff) [12:58:07] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [12:58:17] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1300). [13:00:05] Dreamy_Jazz and kizule: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] \o [13:00:13] o/ [13:00:18] \o [13:00:24] I can deploy! [13:00:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! We need the same for idp_test.yaml as well, BTW." [puppet] - 10https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [13:01:00] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [13:01:10] hm, logspam-watch on mwlog1002 isn’t coming up yet [13:01:15] ah there it is, nevermind [13:01:25] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [13:01:58] (03CR) 10Jgiannelos: [C: 03+1] Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951850 (owner: 10Effie Mouzeli) [13:02:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "kick off gate-and-submit while I go deploy some config changes first" [extensions/CheckUser] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951847 (https://phabricator.wikimedia.org/T344787) (owner: 10Dreamy Jazz) [13:03:04] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [13:03:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:03:29] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [13:03:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951913 (https://phabricator.wikimedia.org/T344815) (owner: 10Zoranzoki21) [13:04:12] (03Merged) 10jenkins-bot: [pawiki] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951913 (https://phabricator.wikimedia.org/T344815) (owner: 10Zoranzoki21) [13:05:03] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:951913|[pawiki] Enable the SandboxLink extension (T344815)]] [13:05:10] T344815: Install SandboxLink Extension in Pawiki - https://phabricator.wikimedia.org/T344815 [13:06:43] !log lucaswerkmeister-wmde@deploy1002 zoranzoki21 and lucaswerkmeister-wmde: Backport for [[gerrit:951913|[pawiki] Enable the SandboxLink extension (T344815)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:06:56] Kizule: please test :) [13:07:50] seems to work for me, though someone™ should probably translate the word “Sandbox” soon™ [13:08:08] * Kizule testing [13:08:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:08:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp [13:08:57] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp [13:09:41] Lucas_WMDE: Is it already deployed out of mwdebug? [13:09:45] only mwdebug [13:09:50] waiting for confirmation before syncing [13:10:18] (my own test was just for curiosity, I don’t consider that sufficient for deploying unless you suddenly vanish or something ^^) [13:10:31] I'm confused because there is link to sandbox right after link to talk page, out of mwdebug. [13:10:39] did you try Ctrl+F5? [13:10:47] for me a normal F5 sometimes didn’t trigger the change (in either direction) [13:11:15] Ohhh.. They have added it manually, that was confusing me. Now I see link, yeah, this is good to go. [13:11:21] !log lucaswerkmeister-wmde@deploy1002 zoranzoki21 and lucaswerkmeister-wmde: Continuing with sync [13:11:21] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for lwatson - https://phabricator.wikimedia.org/T344772 (10lwatson) Great, thanks! @jbond [13:11:28] ah ok [13:11:41] so there’s a manually added link with correct translation, and the untranslated one is new? [13:11:47] Yes [13:11:55] ah ok [13:12:05] I couldn’t distinguish the translated one from any other link ;) [13:12:12] (well, I suppose I could hover it and see which link goes to my user page. whatever) [13:12:18] syncing now [13:13:12] Okay, thanks! [13:13:37] and then I’ll do one of Dreamy_Jazz’ backports before continuing with your other config change fyi [13:13:44] * Lucas_WMDE checks on enwiktionary in the meantime [13:14:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Can confirm there are no pages in it: https://en.wiktionary.org/wiki/Special:AllPages?namespace=104, https://en.wiktionary.org/wiki/Specia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) (owner: 10Zoranzoki21) [13:16:37] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [13:16:38] (03Merged) 10jenkins-bot: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951847 (https://phabricator.wikimedia.org/T344787) (owner: 10Dreamy Jazz) [13:16:49] !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes [13:17:04] dangit, the backport merged just a moment before I could `scap backport` it ^^ [13:17:06] ah well [13:17:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:951913|[pawiki] Enable the SandboxLink extension (T344815)]] (duration: 12m 06s) [13:17:14] T344815: Install SandboxLink Extension in Pawiki - https://phabricator.wikimedia.org/T344815 [13:17:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:42] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:951847|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] [13:17:47] T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787 [13:18:00] Dreamy_Jazz: is the “remove duplicate entries” change testable on mwdebug? [13:18:07] (not deployed yet, asking in advance) [13:18:09] Yes [13:18:15] ok [13:18:29] (03PS1) 10Herron: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/951851 (https://phabricator.wikimedia.org/T343987) [13:18:36] (03PS1) 10Papaul: Add new kubernetes node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/951921 (https://phabricator.wikimedia.org/T342534) [13:18:46] Does take a little while to test (requires some multi-browser editing), but shouldn't take more than a few minutes [13:18:55] ok cool [13:19:17] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and dreamyjazz: Backport for [[gerrit:951847|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:19:24] Testing... [13:19:24] then please test now :) [13:19:27] * Lucas_WMDE grabs a cup of tea [13:21:37] (03CR) 10Herron: [C: 03+2] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/951851 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [13:22:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:49] (03CR) 10Effie Mouzeli: [C: 03+1] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/951851 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [13:23:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [13:23:39] Hmm. Testing works, but the issue doesn't appear when I'm not using mwdebug [13:23:46] hm [13:24:36] Let me try on enwiki [13:24:45] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:05] Lucas_WMDE: Can you check if that change isn't already in production on wmf.23? [13:26:08] It fails on enwiki [13:26:12] I can try [13:26:13] let me see [13:26:13] But doesn't fail on test.wikipedia.org [13:26:26] test.wikipedia.org is on wmf.23 and enwiki is on wmf.22 [13:26:39] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp [13:26:45] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [13:26:50] I'm wondering whether it being merged before the previous scap finished meant that it applied? [13:27:55] The fix for the issue wasn't made until today, so there shouldn't be a way for the issue to not apply to wmf.23. [13:28:17] mw1430 (random appserver) /srv/mediawiki/php-1.41.0-wmf.23/extensions/CheckUser/src/ClientHints/ClientHintsData.php doesn’t have the code yet afaict [13:28:37] Let me try again. [13:28:42] it only merged during the last few php-fpm restarts, I don’t think it should’ve gotten synced anywhere [13:29:42] Going to try another wiki [13:30:46] (03PS1) 10Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) [13:30:48] ok [13:31:39] It is also not doing as expected on test.wikidata.org [13:32:00] in that it’s not erroring even without mwdebug? [13:32:02] i.e. test.wikidata.org doesn't have the server error when not on debug [13:32:06] hmph [13:32:07] Yes [13:32:10] (03PS1) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [13:32:19] at least it’s not the other way around… [13:32:23] Ikr [13:32:35] I’d still finish this sync just so that the merged state is consistent with what’s deployed [13:32:42] Sure [13:32:59] and it failed on enwiki right? so we could still do the wmf.22 one afterwards, that one isn’t behaving unexpectedly so far IIUC [13:33:05] Yes. It failed on enwiki. [13:33:10] ok, then let’s do that [13:33:11] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and dreamyjazz: Continuing with sync [13:33:28] but let’s remove some variables by not having concurrent gate-and-submits ^^ [13:33:38] Sure. [13:33:43] enwiktionary has had its Index namespace unused for 2 years, it can wait a bit longer [13:33:55] :) [13:34:00] (fyi Kizule – I might not do that one today) [13:34:15] I'm happy to move my config change to a later window if desired. [13:34:27] If that makes room for the other change. [13:34:40] well, I’ll do the wmf.22 change and then see what else there’s time for [13:34:42] jouncebot: next [13:34:42] In 1 hour(s) and 25 minute(s): Phabricator to Phorge migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1500) [13:34:46] ooooooooooh [13:34:50] Ikr [13:34:53] (but until then we could theoretically overrun the window a bit) [13:35:31] :+1 [13:35:40] (y) [13:37:04] Well, I would love to get my patch deployed in this window, since it's quick and easy one, but I'm hoping that we will be able to have all scheduled patches deployed. :) [13:37:47] (03PS1) 10Jbond: idp: drop superfluous permissions [puppet] - 10https://gerrit.wikimedia.org/r/951924 (https://phabricator.wikimedia.org/T341581) [13:38:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/951924 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [13:38:39] I've also re-tested the error locally and confirmed that without the fix (even on the master branch) the server still errors out. [13:38:55] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:951847|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] (duration: 21m 12s) [13:38:58] So no idea why it was fixed on non-mwdebug on wmf.23 [13:39:00] T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787 [13:39:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:39:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951846 (https://phabricator.wikimedia.org/T344787) (owner: 10Dreamy Jazz) [13:45:25] (03CR) 10Jbond: "LGTM minor nit optimisation inline" [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:46:07] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [13:48:10] (03CR) 10Herron: [C: 03+1] icinga: Add notification when purging nagios resources [puppet] - 10https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) (owner: 10Andrea Denisse) [13:48:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp [13:48:24] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (10Jclark-ctr) [13:48:30] (03PS1) 10Gmodena: Remove rc1.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) [13:48:36] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (10Jclark-ctr) 05Open→03Resolved [13:50:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2004.codfw.wmnet with OS bookworm [13:51:58] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (10Jclark-ctr) [13:52:04] (03CR) 10Jbond: [C: 03+2] idp: drop superfluous permissions [puppet] - 10https://gerrit.wikimedia.org/r/951924 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [13:52:12] (03PS1) 10Majavah: P:terraform: don't serve BUSL licensed Terraform versions [puppet] - 10https://gerrit.wikimedia.org/r/951934 [13:52:20] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (10Jclark-ctr) 05Open→03Resolved [13:52:26] (03CR) 10Jbond: [C: 03+2] idp: drop superfluous permissions [puppet] - 10https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [13:52:30] (03PS2) 10Jbond: idp: drop superfluous permissions [puppet] - 10https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581) [13:52:48] (03Merged) 10jenkins-bot: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951846 (https://phabricator.wikimedia.org/T344787) (owner: 10Dreamy Jazz) [13:52:51] (03PS3) 10Jbond: idp: add datacenter-ops to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) [13:53:11] (03CR) 10Papaul: [C: 03+2] Add new kubernetes node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/951921 (https://phabricator.wikimedia.org/T342534) (owner: 10Papaul) [13:53:18] (03PS1) 10Jgiannelos: tegola debug: Change schedule of eqiad cronjobs temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/951936 [13:53:19] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:951846|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] [13:53:23] T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787 [13:53:57] (03CR) 10Andrea Denisse: [C: 03+2] icinga: Add notification when purging nagios resources [puppet] - 10https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) (owner: 10Andrea Denisse) [13:54:50] !log lucaswerkmeister-wmde@deploy1002 dreamyjazz and lucaswerkmeister-wmde: Backport for [[gerrit:951846|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:54:57] Testing now [13:55:05] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340155 (10Jclark-ctr) 05Open→03Resolved [13:55:12] ok [13:56:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test-eqiad cluster: Reboot kafka nodes [13:56:49] For some reason the same thing has happened. [13:56:51] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342693 (10Jclark-ctr) 05Open→03Resolved [13:57:06] o_O [13:57:23] debug mode is definitely off [13:58:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [13:58:09] Literally the same request in the console (changing only the revision ID) now works on non-debug servers [13:58:13] did you check in dev tools too? [13:58:14] huh [13:58:25] (I assume console means outside the browser and thus renders my question pointless) [13:58:34] Using the browser console [13:58:45] I was wondering if the extension is maybe buggy and always adding the header for some reason [13:58:47] I could try a non-browser console, but I'm using the fetch command [13:58:50] maybe check in the network panel? [13:59:27] Actually that might be it [13:59:50] x-wikimedia-debug is set to backend=mwdebug1001.eqiad.wmnet when the extension has the debug mode off [14:00:12] This is still the case even after a Ctr + F5 [14:00:16] *Ctrl [14:00:17] that sounds like it shouldn’t happen [14:00:22] Ikr [14:00:22] and doesn’t happen on my end [14:00:27] firefox or chrome? [14:00:30] Firefox [14:00:31] (I’m on ff) [14:00:32] hm ok [14:00:42] However, the fix works as intended [14:00:46] yeah, good to sync I assume [14:00:50] Yes [14:00:53] !log lucaswerkmeister-wmde@deploy1002 dreamyjazz and lucaswerkmeister-wmde: Continuing with sync [14:01:01] and then probably a phab task for the extension and/or firefox being buggy? [14:01:09] jouncebot: now [14:01:09] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [14:01:15] I will investigate further [14:01:35] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [14:01:39] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [14:01:39] Oh. I've realised what has happened [14:01:42] I think I’ll stop deploying after this change and not overrun the window too much [14:01:47] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF) 05Resolved→03Open Hi All - Reopening to confirm that my the SQL Lab role in superset has been applied correctly... [14:01:50] to leave more of a break before the big phab move [14:01:54] yes? [14:02:02] When copying the request from chrome as a "fetch" is also copies the headers [14:02:08] I had not noticed this [14:02:11] ahhhh yes [14:02:20] My mistake then. Apologies. [14:02:32] ok, mystery solved then \o/ [14:02:33] phew [14:03:47] I've moved my config change to the next window. [14:03:51] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:04:30] Should I remove the config change from the current window or is there a way to indicate it wasn't done due to time constraints? [14:04:51] Dreamy_Jazz: I wouldn’t usually bother updating the finished window tbh [14:05:00] Okay. Thanks. [14:05:03] if someone wants to know whether something was deployed or not they should look at gerrit or SAL [14:05:17] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (10Jclark-ctr) [14:05:27] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (10Jclark-ctr) 05Open→03Resolved [14:05:47] And apologies to Kizule for delaying their change being made by not noticing the debug header being included in the console request. [14:05:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2053 - pt1979@cumin2002" [14:06:00] No problem, I'm moving my patch to another window as well. :) [14:06:30] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:951846|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] (duration: 13m 10s) [14:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:38] T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787 [14:06:44] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:44] !log UTC afternoon backport+config window done [14:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:49] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:06] Thanks for the deploy! [14:07:21] From me as well, see you later! [14:07:38] see you! [14:08:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:25] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (10Jclark-ctr) 05Open→03Resolved [14:08:44] (03PS3) 10JMeybohm: admin_ng: Include admin service secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) [14:09:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:11:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:15:44] (03CR) 10Stevemunene: [C: 03+2] switch an-worker[17-48] to reuse-analytics-hadoop recipe [puppet] - 10https://gerrit.wikimedia.org/r/951458 (https://phabricator.wikimedia.org/T332570) (owner: 10Stevemunene) [14:15:46] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344127 (10Jclark-ctr) Replaced failed cable [14:16:44] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:50] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1117.eqiad.wmnet with OS bullseye [14:18:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344127 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:18:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [14:19:18] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Include admin service secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [14:20:09] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:21:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [14:22:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet [14:22:23] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T344394 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced bower [14:22:40] (03PS5) 10Muehlenhoff: Make nftables::service types more compatible [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) [14:23:36] !log pool kartotherian in codfw for testing T344324 [14:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:41] T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [14:23:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:23:47] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [14:24:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:11] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:26:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet [14:26:24] !log update to HAProxy 2.7.10 in cp4052 and cp5032 - T344047 [14:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:57] (03Merged) 10jenkins-bot: admin_ng: Include admin service secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [14:27:19] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:17] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:28:54] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4052.*,cp5032.*} and A:cp [14:30:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10Jclark-ctr) 05Open→03Resolved Relabled Servers [14:31:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10Clement_Goubert) Thank you, sorry for the out-of-order operation [14:31:33] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:31:35] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:31:47] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:31:57] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:13] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:15] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:16] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:32:21] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:34] !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [14:32:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet [14:33:33] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:33:47] we know about the maps hosts [14:34:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4052.*,cp5032.*} and A:cp [14:34:28] !log depool again kartotherian in codfw for testing T344324 [14:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:32] T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [14:34:35] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:35:56] (03CR) 10Joal: [C: 03+1] "I guess we'll can deploy this safely as no more producer uses this stream, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [14:36:45] (03CR) 10BCornwall: [C: 03+2] sre.cdn.roll-reboot: Reduce min_grace_sleep to 300 [cookbooks] - 10https://gerrit.wikimedia.org/r/951196 (owner: 10BCornwall) [14:36:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet [14:37:05] (03PS6) 10Hnowlan: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [14:37:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2002.codfw.wmnet [14:38:33] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:40:01] (03CR) 10Gmodena: Remove rc1.mediawiki.page_content_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [14:41:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2002.codfw.wmnet [14:43:43] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.761 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:44:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet [14:44:35] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:45:03] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:45:47] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:45:57] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 6.672 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:46:13] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:47:28] (03PS2) 10BBlack: varnish: parameterize fe cache mem reservation [puppet] - 10https://gerrit.wikimedia.org/r/849633 [14:47:30] (03PS1) 10BBlack: esams: experimental frontend memory settings [puppet] - 10https://gerrit.wikimedia.org/r/951949 [14:47:53] (03CR) 10Jbond: [C: 03+1] "lgtm but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:47:57] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:48:25] (03PS7) 10Hnowlan: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [14:48:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet [14:49:06] (03CR) 10CI reject: [V: 04-1] varnish: parameterize fe cache mem reservation [puppet] - 10https://gerrit.wikimedia.org/r/849633 (owner: 10BBlack) [14:50:01] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10jbond) > Reopening you say that the role has been applied correctly, Is there some further action required? [14:50:12] (03PS1) 10JMeybohm: Remove admin secrets from service secrets [labs/private] - 10https://gerrit.wikimedia.org/r/951951 (https://phabricator.wikimedia.org/T297417) [14:50:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-launcher1002.eqiad.wmnet [14:50:34] (03PS1) 10JMeybohm: deployment_server::services: Drop dummy admin services [puppet] - 10https://gerrit.wikimedia.org/r/951952 (https://phabricator.wikimedia.org/T297417) [14:50:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove admin secrets from service secrets [labs/private] - 10https://gerrit.wikimedia.org/r/951951 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [14:53:47] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42996/console" [puppet] - 10https://gerrit.wikimedia.org/r/951952 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [14:54:14] (03Abandoned) 10Fabfur: haproxy: sanitize eventual duplicate content-length header [puppet] - 10https://gerrit.wikimedia.org/r/951832 (https://phabricator.wikimedia.org/T344047) (owner: 10Fabfur) [14:54:17] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server::services: Drop dummy admin services [puppet] - 10https://gerrit.wikimedia.org/r/951952 (https://phabricator.wikimedia.org/T297417) (owner: 10JMeybohm) [14:55:25] (03PS3) 10BBlack: varnish: parameterize fe cache mem reservation [puppet] - 10https://gerrit.wikimedia.org/r/849633 [14:55:27] (03PS2) 10BBlack: esams: experimental frontend memory settings [puppet] - 10https://gerrit.wikimedia.org/r/951949 [14:55:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10nskaggs) Any update on status from Dell on getting this hardware operational? Are we still waiting on the correct controller cards? [14:56:08] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) [14:56:35] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:57:14] !log deploy codfw tegola-vector-tiles with high CPU limits to rule out a hunch. T344324 [14:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:18] T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [14:57:54] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1117.eqiad.wmnet with OS bullseye [14:58:24] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:58:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4004.wikimedia.org [14:58:57] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:59:04] (03PS2) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) [14:59:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) [14:59:17] !log pool kartotherian in codfw for testing T344324 [14:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:29] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [14:59:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2004.codfw.wmnet with OS bookworm [15:00:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-launcher1002.eqiad.wmnet [15:00:04] brennen: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator to Phorge migration deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1500). [15:00:20] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF) Sorry typo above >>! In T344257#9113615, @OSefu-WMF wrote: > Hi All - Reopening to confirm that my the SQL Lab rol... [15:00:54] o/ [15:01:01] o/ [15:01:03] hype [15:01:17] brennen: marostegui: I have phab backups on both datacenters- the finished correctly and are currently being compressed to recover them quickly [15:01:23] cool [15:01:30] brennen: please let me know before putting phab in RO [15:01:49] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10BTullis) Thanks @OSefu-WMF - Could you try running a query again now please? I made a change recently (https://gerrit.wikimed... [15:01:53] I will be here just in standby mode [15:02:40] marostegui: we're downtiming the service now, will probably just stop httpd and phd, then let you know. [15:02:48] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: Switch Phabricator to Phorge [15:02:58] brennen: excellent, thanks [15:03:03] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: Switch Phabricator to Phorge [15:04:13] (03PS3) 10Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) [15:04:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [15:04:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4004.wikimedia.org [15:05:09] (03CR) 10FNegri: [C: 03+1] "Agreed. Hopefully something will come out of the OpenTF initiative:" [puppet] - 10https://gerrit.wikimedia.org/r/951934 (owner: 10Majavah) [15:06:07] (03CR) 10Klausman: prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:06:39] PROBLEM - Host an-druid1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:45] RECOVERY - Host an-druid1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [15:07:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [15:07:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [15:08:41] Should I change topic? [15:08:57] marostegui: phab httpd & phd are down [15:09:01] ok [15:09:02] one sec [15:09:12] jynus: might be good to mention phab maint [15:09:20] hopefully this will be brief. :) [15:09:24] (things i should not say aloud.) [15:09:36] brennen: replication stopped [15:09:55] marostegui: good to proceed with migration? [15:09:59] brennen: yep [15:10:03] cool, here goes [15:10:07] ^ [15:10:19] that will also CC urandom and herron [15:10:35] !log brennen@deploy1002 Started deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885) [15:11:09] ack [15:11:10] !log brennen@deploy1002 Finished deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885) (duration: 00m 34s) [15:11:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [15:12:44] marostegui: ok, phab is back up, migrations should i believe have happened, lemme confirm that... [15:12:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [15:13:02] * brett cheers from the sidelines [15:13:22] My changes to a task made minutes before seem to still be there [15:13:23] brennen: sure, we can leave replication stopped till tomorrow on the "just in-case host" [15:13:37] That shouldn't be an issue [15:13:49] oh, wow, that was fast [15:14:01] yeah, scap deploy is pretty quick [15:14:04] (03PS2) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [15:14:10] I thought it was going to be like a multiple-hours db migration [15:14:30] (03CR) 10CI reject: [V: 04-1] New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [15:14:35] Well, the font changed. [15:14:35] So it must be new code. :-) [15:16:10] (03CR) 10BBlack: "PCC on a few nodes: https://puppet-compiler.wmflabs.org/output/951949/42998/" [puppet] - 10https://gerrit.wikimedia.org/r/951949 (owner: 10BBlack) [15:16:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage [15:17:16] One thing that might be different is the Activity pane doesn't seem expanded by default. I can't remember 100% whether it was expanded by default before, but the empty space looks odd. [15:17:45] Dreamy_Jazz: It was expanded by default before, yeah [15:17:49] At least it'd show stuff [15:18:13] It does still show stuff if you click on one of the tabs [15:18:17] yeah [15:18:35] I expect minor inconveniences to show up, it always happens on upgrade [15:18:44] ^ [15:18:50] but as long as it is that, it is not a big issue [15:19:02] No problem with it being like this. Just wanted to report it. [15:19:45] thanks, noted. we think this is an upstream issue that should be fixed with future updates. [15:19:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage [15:19:56] 👍 [15:20:25] marostegui: leaving replication stopped seems sensible. i'll be around through US workday tomorrow. hoping we don't identify anything that needs a rollback though. [15:20:35] brennen: is the maintenance then finished, other than monitoring? [15:21:04] !log brennen@deploy1002 Started deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885) [15:21:09] T333885: Migrate phabricator.wikimedia.org to Phorge as upstream - https://phabricator.wikimedia.org/T333885 [15:21:09] brennen: No problem, I will leave it stopped until you give me green light [15:21:21] jynus: we're updating the fallback machine and then this should just be monitoring. [15:21:27] (03PS1) 10Effie Mouzeli: tegola: bump image and cpu limits on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) [15:21:37] I see, then I will wait for that to finish [15:21:43] !log brennen@deploy1002 Finished deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885) (duration: 00m 38s) [15:22:15] (03CR) 10CI reject: [V: 04-1] tegola: bump image and cpu limits on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:23:26] (03CR) 10Vgutierrez: [C: 03+1] varnish: parameterize fe cache mem reservation [puppet] - 10https://gerrit.wikimedia.org/r/849633 (owner: 10BBlack) [15:24:39] (03PS2) 10Effie Mouzeli: tegola: bump image and cpu limits on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) [15:25:30] (03CR) 10CI reject: [V: 04-1] tegola: bump image and cpu limits on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:25:39] jynus: should be good [15:25:40] (03PS3) 10Effie Mouzeli: tegola: bump image and cpu limits on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) [15:28:05] (03CR) 10Eevans: [C: 03+2] Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy) [15:28:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] tegola: bump image and cpu limits on codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:29:37] (03CR) 10Effie Mouzeli: tegola: bump image and cpu limits on codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:29:46] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:29:48] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [15:30:09] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [15:30:34] (03PS1) 10Gmodena: data-engineering: flink: alert when TM is missing for 5m. [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) [15:31:45] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: esams sandbox - ayounsi@cumin1001" [15:31:54] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply [15:31:58] (03PS4) 10Effie Mouzeli: tegola: bump cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) [15:32:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes [15:32:17] (03CR) 10Vgutierrez: [C: 03+1] esams: experimental frontend memory settings [puppet] - 10https://gerrit.wikimedia.org/r/951949 (owner: 10BBlack) [15:32:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] tegola: bump cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:32:51] (03CR) 10Gmodena: "I suspect this was the cause of alerts fired during a maintenance restart today:" [alerts] - 10https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [15:33:07] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [15:33:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: esams sandbox - ayounsi@cumin1001" [15:33:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:23] (03CR) 10Effie Mouzeli: [C: 03+2] tegola: bump cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:34:06] (03Merged) 10jenkins-bot: tegola: bump cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [15:34:32] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply [15:35:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2004.codfw.wmnet with OS bookworm [15:35:43] (03PS1) 10Ebernhardson: Draft: cirrus streaming updater producer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [15:35:43] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [15:36:58] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:03] (03PS3) 10FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) [15:37:19] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:37:32] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:37:33] (03CR) 10CI reject: [V: 04-1] New files/templates for OpenStack Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [15:38:09] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [15:39:09] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [15:39:28] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:33] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [15:40:19] (03CR) 10BBlack: [C: 03+2] varnish: parameterize fe cache mem reservation [puppet] - 10https://gerrit.wikimedia.org/r/849633 (owner: 10BBlack) [15:40:24] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [15:40:27] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [15:40:29] (03CR) 10BBlack: [C: 03+2] esams: experimental frontend memory settings [puppet] - 10https://gerrit.wikimedia.org/r/951949 (owner: 10BBlack) [15:40:46] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [15:44:22] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1117.eqiad.wmnet with OS bullseye [15:44:22] (03PS2) 10Majavah: P:terraform: don't serve BUSL licensed Terraform versions [puppet] - 10https://gerrit.wikimedia.org/r/951934 [15:44:56] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [15:45:10] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:46:14] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:46:24] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:46:46] (03CR) 10Majavah: [C: 03+2] P:terraform: don't serve BUSL licensed Terraform versions [puppet] - 10https://gerrit.wikimedia.org/r/951934 (owner: 10Majavah) [15:47:22] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:47:30] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:47:38] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:50:06] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.649 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:50:14] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:50:20] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:50:30] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:50:38] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:50:56] (03CR) 10Kamila Součková: [C: 03+1] helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [15:51:01] (03PS22) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [15:51:46] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.553 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:52:24] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: bump cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/951962 [15:53:44] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: bump cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/951962 (owner: 10Effie Mouzeli) [15:54:03] (03PS1) 10Kamila Součková: cassandra-http-gateway: remove typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/951964 [15:54:07] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [15:54:11] (03PS2) 10JMeybohm: deployment_server: Add jaeger user to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) [15:54:31] (03Merged) 10jenkins-bot: tegola-vector-tiles: bump cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/951962 (owner: 10Effie Mouzeli) [15:55:33] (03PS2) 10Kamila Součková: cassandra-http-gateway: remove typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/951964 [15:55:34] !log pooled codfw kartotherian/maps [15:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:29] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42999/console" [puppet] - 10https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [15:57:09] !log cp3066 - varnish-frontend-restart for new memory params experiment [15:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:51] (03CR) 10Kamila Součková: "here, have some cleanup in return for the other review :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951964 (owner: 10Kamila Součková) [15:59:30] (03CR) 10Hnowlan: [C: 03+2] helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [15:59:56] (03CR) 10Hnowlan: [C: 03+1] cassandra-http-gateway: remove typo in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/951964 (owner: 10Kamila Součková) [16:01:59] (03PS1) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 [16:02:34] (03Merged) 10jenkins-bot: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [16:05:52] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:06:25] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:06:36] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:07:21] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:07:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_eqiad and A:cp [16:08:51] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:09:17] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqiad and A:cp [16:09:22] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:10:22] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:49] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:11:35] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission thumbor100[12] - https://phabricator.wikimedia.org/T344598 (10RobH) [16:11:48] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:55] (03PS1) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/951853 [16:14:57] (03PS2) 10Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - 10https://gerrit.wikimedia.org/r/951965 [16:16:41] (03CR) 10Herron: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/951853 (owner: 10Herron) [16:17:10] !log depool maps/karothertian codfw [16:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:58] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [16:19:43] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF) Unfortunately I'm still getting the same error in SQL lab. [16:24:44] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:25:19] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:27:16] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [16:35:15] !log cp3067-81 - rolling restart of varnish frontends (one at a time, 30 minute sleep between, will run for ~7.5h), for experimental cache memory settings from https://gerrit.wikimedia.org/r/c/operations/puppet/+/951949 [16:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:06] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10BTullis) Oh, sorry about this @OSefu-WMF. I've just tried the same query from your screenshot above and its working for me. I... [16:37:08] (03PS1) 10Andrea Denisse: alerting_host: Failover Icinga and Alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/951968 (https://phabricator.wikimedia.org/T344671) [16:37:25] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [16:37:43] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:48] (03PS1) 10Andrea Denisse: dns: Repoint alert host services from alert1001 to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/951969 (https://phabricator.wikimedia.org/T344671) [16:43:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2053 - pt1979@cumin2002" [16:43:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:43:55] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:45:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2053.mgmt.codfw.wmnet with reboot policy FORCED [16:46:23] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF) @BTullis After login/logout I am able to view dashboards but it seems like the issues is confined to the SQL Lab fe... [16:46:35] (03CR) 10Herron: [C: 03+1] dns: Repoint alert host services from alert1001 to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/951969 (https://phabricator.wikimedia.org/T344671) (owner: 10Andrea Denisse) [16:46:43] (03CR) 10Herron: [C: 03+1] alerting_host: Failover Icinga and Alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/951968 (https://phabricator.wikimedia.org/T344671) (owner: 10Andrea Denisse) [16:54:48] (03PS14) 10Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308) [16:56:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2053.mgmt.codfw.wmnet with reboot policy FORCED [16:57:43] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:08] (03PS3) 10Dduvall: gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) [16:59:10] (03CR) 10Dduvall: gitlab: Support loading of local gems (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1700) [17:00:08] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2053'] [17:00:11] 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10thcipriani) Hrm. We get an email from the systemd timer for this, so the alert is probably not necessary. We're not very familiar with alertmanager. Can we just remove this alert? [17:02:43] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:35] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export [17:03:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export [17:05:04] !log set icinga downtime on wikitech-static [17:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: Decommission thumbor100[12] - https://phabricator.wikimedia.org/T344598 (10ayounsi) I'm going to hijack those 2 hosts before they get decommissioned for some tests. I'll rename them ganeti-test1001/1002. [17:06:15] !log reboot alert2001 for a kernel upgrade [17:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:31] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert2001.wikimedia.org [17:07:32] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2001.wikimedia.org [17:08:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:58] (03PS1) 10Brennen Bearnes: Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 [17:10:36] jouncebot: nowandnext [17:10:36] For the next 0 hour(s) and 49 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1700) [17:10:36] In 0 hour(s) and 49 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800) [17:10:36] In 0 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800) [17:10:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2053'] [17:11:19] (03CR) 10Effie Mouzeli: "We believe that this commit bumped our rps, thus SRE requested this revert https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-r" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 (owner: 10Brennen Bearnes) [17:11:31] (03CR) 10Effie Mouzeli: [C: 03+1] Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 (owner: 10Brennen Bearnes) [17:11:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 (owner: 10Brennen Bearnes) [17:13:14] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:15:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:16:07] (03CR) 10Andrea Denisse: [C: 03+2] alerting_host: Failover Icinga and Alertmanger from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/951968 (https://phabricator.wikimedia.org/T344671) (owner: 10Andrea Denisse) [17:17:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:18:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [17:19:17] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [17:19:46] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2040-kubernetes2052 - pt1979@cumin2002" [17:20:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2040-kubernetes2052 - pt1979@cumin2002" [17:20:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:21:37] (03CR) 10Kosta Harlan: "I don't see how this is related to a jump in RPS. This patch fixed a narrow case where someone (probably manually tampering with the colle" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 (owner: 10Brennen Bearnes) [17:22:20] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad and A:cp [17:22:40] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad and A:cp [17:23:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:07] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-text_eqiad and A:cp [17:23:08] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-upload_eqiad and A:cp [17:23:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2053.codfw.wmnet with OS bullseye [17:23:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye [17:24:06] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and A:cp [17:24:22] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and A:cp [17:24:49] brennen: per discussion in -serviceops I'd suggest not reverting... or if it's too late, I guess I can make a revert of the revert :) [17:25:02] i think i can -2 it so it doesn't merge [17:25:18] (03CR) 10Brennen Bearnes: [C: 04-2] Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 (owner: 10Brennen Bearnes) [17:25:47] kostajh: and killed scap [17:26:06] thank you [17:26:42] (03Abandoned) 10Brennen Bearnes: Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/951855 (owner: 10Brennen Bearnes) [17:27:02] sorry for the runaround brennen <3 appreciate the quick response anyway [17:27:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [17:27:36] rzl: no worries. [17:28:04] can someone bring logmsgbot back to life please? :)) [17:28:07] (should be `tcpircbot-logmsgbot` on alert1001) [17:28:41] urbanecm: there's some maintenance ongoing on alert hosts, it should return soonish AIUI [17:28:42] denisse: ^ fyi [17:28:49] ack, ty [17:28:58] i am going to have to go afk for a bit here, so if further deploy followup does turn out to be needed please coordinate with train folks for the upcoming window. (i think that's dduvall today.) [17:29:07] brennen: ack [17:29:25] urbanecm: Yes, we're doing some maintenance on those host. Apologies for the downtime caused! [17:29:47] np, thought it just disconnected randomly (ircbots have a tendency of doing that :D ) [17:29:56] (03CR) 10Andrea Denisse: [C: 03+2] dns: Repoint alert host services from alert1001 to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/951969 (https://phabricator.wikimedia.org/T344671) (owner: 10Andrea Denisse) [17:31:26] !log failing over alert1001 to alert2001 [17:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Papaul) @nskaggs please see last update @https://phabricator.wikimedia.org/T339131 [17:47:25] !log make alert2001 the active host [17:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:11] (AlertManagerNotificationFail) firing: AlertManager is failing to deliver notifications - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DAlertManagerNotificationFail [17:49:14] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:49:18] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:49:23] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:49:27] (RedisMemoryFull) firing: (6) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:49:31] (AlertManagerNotificationFail) resolved: AlertManager is failing to deliver notifications - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DAlertManagerNotificationFail [17:50:52] PROBLEM - Check systemd state on alert2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-icinga-state.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:46] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert1001.wikimedia.org [17:51:47] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert1001.wikimedia.org [17:52:56] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:53:37] 10SRE: Wiki store page indexing issues detected - https://phabricator.wikimedia.org/T344844 (10SHust) [17:57:43] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:58:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:59:18] (03PS1) 10Papaul: Add new kubernetes node to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/951975 (https://phabricator.wikimedia.org/T342534) [17:59:48] (03PS1) 10Andrea Denisse: Revert "alerting_host: Failover Icinga and Alertmanger from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/951856 [18:00:06] dduvall and dancy: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800). [18:00:06] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800). [18:00:10] (03CR) 10Papaul: [C: 03+2] Add new kubernetes node to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/951975 (https://phabricator.wikimedia.org/T342534) (owner: 10Papaul) [18:02:00] (03CR) 10Andrea Denisse: [C: 03+2] Revert "alerting_host: Failover Icinga and Alertmanger from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/951856 (owner: 10Andrea Denisse) [18:02:06] (03CR) 10Herron: [C: 03+1] Revert "alerting_host: Failover Icinga and Alertmanger from eqiad to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/951856 (owner: 10Andrea Denisse) [18:02:39] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [18:03:48] !log failing over from alert2001 to alert1001 [18:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:06:22] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951976 (https://phabricator.wikimedia.org/T343725) [18:06:24] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951976 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [18:07:03] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951976 (https://phabricator.wikimedia.org/T343725) (owner: 10TrainBranchBot) [18:08:44] (03PS1) 10Andrea Denisse: Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/951857 [18:09:29] (03CR) 10Andrea Denisse: [C: 03+2] Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/951857 (owner: 10Andrea Denisse) [18:09:31] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/951857 (owner: 10Andrea Denisse) [18:09:54] !log updating DNS to point to alert1001 [18:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:52] (03CR) 10Herron: [C: 03+1] Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/951857 (owner: 10Andrea Denisse) [18:13:38] !log making alert1001 the primary alert host [18:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:52] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [18:15:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:17:17] !log alert hosts maintenance finished [18:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:29] !log re-enabled icinga meta-monitoring on wikitech-static [18:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:51] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.23 refs T343725 (duration: 06m 01s) [18:19:56] T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725 [18:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:26:03] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:38:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2053.codfw.wmnet with OS bullseye [18:38:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [18:40:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:45:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2053.codfw.wmnet with OS bullseye [18:45:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye [18:45:59] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2001.codfw.wmnet [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:51:25] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:55:43] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:55:50] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@33de526]: (no justification provided) [18:56:11] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@33de526]: (no justification provided) (duration: 00m 20s) [18:57:21] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cassandra-dev2001.codfw.wmnet [19:00:04] (03PS1) 10Eevans: cassandra-dev: monitor tls on port 7000 [puppet] - 10https://gerrit.wikimedia.org/r/951980 [19:00:44] (03CR) 10Eevans: [C: 03+2] cassandra-dev: monitor tls on port 7000 [puppet] - 10https://gerrit.wikimedia.org/r/951980 (owner: 10Eevans) [19:06:43] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2002.codfw.wmnet [19:09:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2053.codfw.wmnet with reason: host reimage [19:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:12:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2053.codfw.wmnet with reason: host reimage [19:13:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cassandra-dev2002.codfw.wmnet [19:14:35] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2003.codfw.wmnet [19:20:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2052.mgmt.codfw.wmnet with reboot policy FORCED [19:20:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:21:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cassandra-dev2003.codfw.wmnet [19:28:56] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:29:29] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:31:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2052.mgmt.codfw.wmnet with reboot policy FORCED [19:31:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:31:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2053.codfw.wmnet with OS bullseye [19:31:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye completed: - kubernetes2053 (**PASS*... [19:31:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt DNS for kubernetes2051 - pt1979@cumin2002" [19:31:49] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:32:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2052'] [19:32:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt DNS for kubernetes2051 - pt1979@cumin2002" [19:32:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:34:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED [19:35:29] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2013.codfw.wmnet [19:35:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [19:43:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2013.codfw.wmnet [19:43:24] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2014.codfw.wmnet [19:45:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2052'] [19:46:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED [19:47:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2052.codfw.wmnet with OS bullseye [19:47:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2052.codfw.wmnet with OS bullseye [19:48:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED [19:49:10] 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10Urbanecm_WMF) Adding some SRE tags. [19:49:57] 10SRE, 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (10RhinosF1) [19:50:03] urbanecm: conflicted with you. Oops. [19:52:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2014.codfw.wmnet [19:52:21] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2019.codfw.wmnet [19:53:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2050.mgmt.codfw.wmnet with reboot policy FORCED [19:54:12] RhinosF1: no worries :) [19:55:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2049.mgmt.codfw.wmnet with reboot policy FORCED [19:59:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T2000). [20:00:04] hmonroy, Dreamy_Jazz, and kizule: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] \o [20:00:25] \o [20:00:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2019.codfw.wmnet [20:00:41] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2021.codfw.wmnet [20:00:58] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:03:22] (03PS4) 10HMonroy: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) [20:03:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2050.mgmt.codfw.wmnet with reboot policy FORCED [20:04:09] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:04:13] Is anyone around for the backport window? [20:05:15] not sure, I can try deploying but it would be my first time doing it [20:05:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2049.mgmt.codfw.wmnet with reboot policy FORCED [20:06:13] hmonroy: https://deploy-commands.toolforge.org/ might be useful but I was talking to a deployer a moment ago [20:06:18] Please hold a few minutes [20:06:42] i can deploy today :) [20:06:51] :D [20:06:59] hi Dreamy_Jazz ! [20:07:05] Hi there [20:07:11] hmonroy: wanna hop on a call and try doing the deployment for your patch? :) [20:07:25] urbanecm: yes! [20:08:04] hmonroy: pm sent with a link [20:08:31] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:09:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2021.codfw.wmnet [20:09:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2024.codfw.wmnet [20:10:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:11:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2051.codfw.wmnet with OS bullseye [20:11:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2051.codfw.wmnet with OS bullseye [20:11:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2052.codfw.wmnet with reason: host reimage [20:11:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [20:12:28] (03Merged) 10jenkins-bot: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [20:12:58] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:950049|wikidiff2: set maxSplitSize = 10 on group1 wikis (T341754)]] [20:13:03] T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754 [20:14:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2052.codfw.wmnet with reason: host reimage [20:14:32] !log hmonroy@deploy1002 hmonroy: Backport for [[gerrit:950049|wikidiff2: set maxSplitSize = 10 on group1 wikis (T341754)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:15:41] 10SRE: store.wikimedia.org page indexing issues detected by google search console - https://phabricator.wikimedia.org/T344844 (10Peachey88) [20:15:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2050.codfw.wmnet with OS bullseye [20:15:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2050.codfw.wmnet with OS bullseye [20:15:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2049.codfw.wmnet with OS bullseye [20:15:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2049.codfw.wmnet with OS bullseye [20:17:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2048.mgmt.codfw.wmnet with reboot policy FORCED [20:17:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2024.codfw.wmnet [20:18:01] !log hmonroy@deploy1002 hmonroy: Continuing with sync [20:18:20] 10SRE, 10Search-Console-access-request: store.wikimedia.org page indexing issues detected by google search console - https://phabricator.wikimedia.org/T344844 (10RhinosF1) Hi, I don't believe SRE maintain search console access. I added the main tag and I think @SCherukuwada is the POC [20:23:22] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:950049|wikidiff2: set maxSplitSize = 10 on group1 wikis (T341754)]] (duration: 10m 24s) [20:23:28] T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754 [20:24:32] Dreamy_Jazz: you're patch will be deploy next :) [20:24:39] Thanks! [20:24:55] (03PS2) 10HMonroy: clienthints: Lower API max lag time to 5 minutes on group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:25:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:12] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2015.codfw.wmnet [20:25:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:25:44] I don't have any easy way to test this other than waiting 5 minutes from an edit to see if the request fails. [20:26:12] (03Merged) 10jenkins-bot: clienthints: Lower API max lag time to 5 minutes on group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) (owner: 10Dreamy Jazz) [20:26:19] If you would like me to do that, I can do so. [20:26:39] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:951833|clienthints: Lower API max lag time to 5 minutes on group0 and 1 (T344797)]] [20:26:43] Actually, I can make that testing edit now. [20:26:43] T344797: Decrease CheckUserClientHintsRestApiMaxTimeLag config on production wikis - https://phabricator.wikimedia.org/T344797 [20:26:47] That should reduce the time. [20:27:16] Dreamy_Jazz: we can proceed and let us to revert if anything fails [20:28:04] Sure. I've made that testing edit now and already set a timer. [20:28:10] !log hmonroy@deploy1002 dreamyjazz and hmonroy: Backport for [[gerrit:951833|clienthints: Lower API max lag time to 5 minutes on group0 and 1 (T344797)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:28:26] !log hmonroy@deploy1002 dreamyjazz and hmonroy: Continuing with sync [20:30:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2048.mgmt.codfw.wmnet with reboot policy FORCED [20:30:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:32:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2015.codfw.wmnet [20:32:41] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2016.codfw.wmnet [20:32:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:32:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2052.codfw.wmnet with OS bullseye [20:32:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2052.codfw.wmnet with OS bullseye completed: - kubernetes2052 (**PASS*... [20:33:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2048'] [20:33:48] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:951833|clienthints: Lower API max lag time to 5 minutes on group0 and 1 (T344797)]] (duration: 07m 09s) [20:33:52] T344797: Decrease CheckUserClientHintsRestApiMaxTimeLag config on production wikis - https://phabricator.wikimedia.org/T344797 [20:33:58] Works as expected. [20:34:21] Dreamy_Jazz: Awesome! It's in production now :) [20:34:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2051.codfw.wmnet with reason: host reimage [20:34:25] Thanks! [20:34:41] Dreamy_Jazz: NP! [20:35:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2047.mgmt.codfw.wmnet with reboot policy FORCED [20:37:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2051.codfw.wmnet with reason: host reimage [20:39:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2050.codfw.wmnet with reason: host reimage [20:40:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2049.codfw.wmnet with reason: host reimage [20:41:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2016.codfw.wmnet [20:41:05] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2020.codfw.wmnet [20:42:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2050.codfw.wmnet with reason: host reimage [20:45:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2048'] [20:45:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2049.codfw.wmnet with reason: host reimage [20:45:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2047.mgmt.codfw.wmnet with reboot policy FORCED [20:45:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2047'] [20:46:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2048.codfw.wmnet with OS bullseye [20:46:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2048.codfw.wmnet with OS bullseye [20:48:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2020.codfw.wmnet [20:48:57] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet [20:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:52:00] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received ht [20:52:00] kitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:27] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:54:02] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:54:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:54:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2051.codfw.wmnet with OS bullseye [20:54:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2051.codfw.wmnet with OS bullseye completed: - kubernetes2051 (**PASS*... [20:55:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2046.mgmt.codfw.wmnet with reboot policy FORCED [20:56:40] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2022.codfw.wmnet [20:57:21] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2025.codfw.wmnet [20:58:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2047'] [20:58:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2047.codfw.wmnet with OS bullseye [20:58:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2047.codfw.wmnet with OS bullseye [20:58:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:01:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:02:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:02:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2050.codfw.wmnet with OS bullseye [21:02:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2050.codfw.wmnet with OS bullseye completed: - kubernetes2050 (**PASS*... [21:02:35] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:02:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2049.codfw.wmnet with OS bullseye [21:02:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2049.codfw.wmnet with OS bullseye completed: - kubernetes2049 (**WARN*... [21:04:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2046.mgmt.codfw.wmnet with reboot policy FORCED [21:05:04] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 3679 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:05:19] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2046'] [21:05:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2045.mgmt.codfw.wmnet with reboot policy FORCED [21:05:41] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2025.codfw.wmnet [21:05:47] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export/downtime test [21:05:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export/downtime test [21:06:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2044.mgmt.codfw.wmnet with reboot policy FORCED [21:07:42] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2012.codfw.wmnet [21:15:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2045.mgmt.codfw.wmnet with reboot policy FORCED [21:15:33] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2012.codfw.wmnet [21:15:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2017.codfw.wmnet [21:17:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2044.mgmt.codfw.wmnet with reboot policy FORCED [21:19:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2047.codfw.wmnet with reason: host reimage [21:20:58] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2047.codfw.wmnet with reason: host reimage [21:23:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2017.codfw.wmnet [21:23:40] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2018.codfw.wmnet [21:25:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:30:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:32:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2018.codfw.wmnet [21:32:26] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2023.codfw.wmnet [21:38:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:40:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2023.codfw.wmnet [21:40:48] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2026.codfw.wmnet [21:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:44:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2046'] [21:44:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:44:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2047.codfw.wmnet with OS bullseye [21:44:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2047.codfw.wmnet with OS bullseye completed: - kubernetes2047 (**PASS*... [21:49:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2026.codfw.wmnet [21:49:03] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2027.codfw.wmnet [21:49:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2046.codfw.wmnet with OS bullseye [21:49:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2046.codfw.wmnet with OS bullseye [21:50:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2045'] [21:50:57] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:51:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2044'] [21:51:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2044'] [21:51:59] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2044'] [21:52:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2043.mgmt.codfw.wmnet with reboot policy FORCED [21:55:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:56:02] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:57:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2027.codfw.wmnet [22:00:32] (03PS1) 10Urbanecm: mediawiki::mcrouter_wancache: add wikifunctions entry [puppet] - 10https://gerrit.wikimedia.org/r/952000 (https://phabricator.wikimedia.org/T344147) [22:00:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:04:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) Hello! I have been issued a new laptop, and now I'm not sure how to set up production access once again. Do I need to provide a new ssh public k... [22:04:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2045'] [22:04:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2044'] [22:04:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2043.mgmt.codfw.wmnet with reboot policy FORCED [22:04:51] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and A:cp [22:05:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2045.codfw.wmnet with OS bullseye [22:05:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2045.codfw.wmnet with OS bullseye [22:06:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2048.codfw.wmnet with OS bullseye [22:07:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2048.codfw.wmnet with OS bullseye executed with errors: - kubernetes20... [22:07:21] (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks again." [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [22:08:25] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and A:cp [22:09:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2044.codfw.wmnet with OS bullseye [22:09:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2044.codfw.wmnet with OS bullseye [22:11:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2046.codfw.wmnet with reason: host reimage [22:13:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2043'] [22:13:27] (03CR) 10Jforrester: [C: 03+1] mediawiki::mcrouter_wancache: add wikifunctions entry [puppet] - 10https://gerrit.wikimedia.org/r/952000 (https://phabricator.wikimedia.org/T344147) (owner: 10Urbanecm) [22:13:46] (03PS1) 10JHathaway: puppetserver: ensure correct ordering when using an intermediate cert [puppet] - 10https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868) [22:15:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2046.codfw.wmnet with reason: host reimage [22:17:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) @Jhancock.wm hey looks like i have no link on kubernetes2048. Thanks ` papaul@asw-d-codfw> show interfaces descriptions ge-5/0/28 Interface Admin Link Descript... [22:19:03] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [22:19:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2042.mgmt.codfw.wmnet with reboot policy FORCED [22:20:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:24:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2043'] [22:25:54] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:25:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:26:12] (SystemdUnitFailed) firing: nginx.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2043.codfw.wmnet with OS bullseye [22:26:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2043.codfw.wmnet with OS bullseye [22:26:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2045.codfw.wmnet with reason: host reimage [22:30:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:30:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2044.codfw.wmnet with reason: host reimage [22:30:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:31:12] (SystemdUnitFailed) resolved: nginx.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2042.mgmt.codfw.wmnet with reboot policy FORCED [22:31:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:31:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2046.codfw.wmnet with OS bullseye [22:31:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2046.codfw.wmnet with OS bullseye completed: - kubernetes2046 (**PASS*... [22:32:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2045.codfw.wmnet with reason: host reimage [22:33:28] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2042'] [22:33:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2041.mgmt.codfw.wmnet with reboot policy FORCED [22:34:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2044.codfw.wmnet with reason: host reimage [22:35:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:44:16] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344872 (10phaultfinder) [22:44:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2041.mgmt.codfw.wmnet with reboot policy FORCED [22:46:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2042'] [22:46:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2041'] [22:47:57] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:48:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2043.codfw.wmnet with reason: host reimage [22:49:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2042.codfw.wmnet with OS bullseye [22:49:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:49:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2045.codfw.wmnet with OS bullseye [22:49:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2042.codfw.wmnet with OS bullseye [22:50:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2040.mgmt.codfw.wmnet with reboot policy FORCED [22:50:55] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:51:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2043.codfw.wmnet with reason: host reimage [22:52:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:52:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2044.codfw.wmnet with OS bullseye [22:52:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2044.codfw.wmnet with OS bullseye completed: - kubernetes2044 (**PASS*... [22:55:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:58:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2041'] [22:59:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2041.codfw.wmnet with OS bullseye [22:59:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2041.codfw.wmnet with OS bullseye [23:00:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [23:00:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:01:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [23:02:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2040.mgmt.codfw.wmnet with reboot policy FORCED [23:07:20] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:09:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:09:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:09:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2043.codfw.wmnet with OS bullseye [23:09:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2043.codfw.wmnet with OS bullseye completed: - kubernetes2043 (**PASS*... [23:10:11] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2040'] [23:10:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2042.codfw.wmnet with reason: host reimage [23:13:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2042.codfw.wmnet with reason: host reimage [23:14:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:15:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:20:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2041.codfw.wmnet with reason: host reimage [23:20:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:24:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2041.codfw.wmnet with reason: host reimage [23:25:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:29:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2040'] [23:29:35] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:30:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2040.codfw.wmnet with OS bullseye [23:30:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2040.codfw.wmnet with OS bullseye [23:30:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:34:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:34:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2042.codfw.wmnet with OS bullseye [23:34:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2042.codfw.wmnet with OS bullseye completed: - kubernetes2042 (**PASS*... [23:35:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:40:08] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:40:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:43:59] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [23:45:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:45:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2041.codfw.wmnet with OS bullseye [23:45:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2041.codfw.wmnet with OS bullseye completed: - kubernetes2041 (**PASS*... [23:45:57] (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:50:57] (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:51:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2040.codfw.wmnet with reason: host reimage [23:52:48] (03PS1) 10Dduvall: P:gitlab::runner: Do not schedule untagged jobs on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/952017 (https://phabricator.wikimedia.org/T344874) [23:54:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2040.codfw.wmnet with reason: host reimage