[00:05:41] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:15:23] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Excempt researcher from hyperkitty monthly export - https://phabricator.wikimedia.org/T385271#10530816 (10Ladsgroup) The apache config that's blocking all requests is: ` # Disable export endpoint (T282957) 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Excempt researcher from hyperkitty monthly export - https://phabricator.wikimedia.org/T385271#10530855 (10Ladsgroup) BTW this exists now, we should deploy it? https://gitlab.com/mailman/hyperkitty/-/merge_requests/389/diffs [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117993 [00:38:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117993 (owner: 10TrainBranchBot) [00:49:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117993 (owner: 10TrainBranchBot) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117644 (owner: 10TrainBranchBot) [01:08:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117996 [01:08:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117996 (owner: 10TrainBranchBot) [01:08:51] (03CR) 10Jdlrobson: [C:03+1] "Ship it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117951 (https://phabricator.wikimedia.org/T385309) (owner: 10Jdrewniak) [01:20:20] (03PS1) 10BryanDavis: deployment-prep: Remove parsoid things from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1117997 (https://phabricator.wikimedia.org/T385849) [01:29:10] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117996 (owner: 10TrainBranchBot) [01:36:26] (03CR) 10Scott French: [C:03+1] "Thanks, @cwhite@wikimedia.org! Looks good, and indeed the CI diffs look like what I would expect given the combination of how the tests wo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [01:41:12] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:42:25] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:44:36] (03CR) 10Ori: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117946 (https://phabricator.wikimedia.org/T385199) (owner: 10CDanis) [01:48:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:48:38] (03PS1) 10Ori: Fix channel name for ArcLamp pipeline for PHP8 [puppet] - 10https://gerrit.wikimedia.org/r/1117998 (https://phabricator.wikimedia.org/T385199) [01:49:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirt1041.eqiad.wmnet [01:49:24] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [01:54:37] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [01:54:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10531057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [01:55:15] (03CR) 10Ori: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117998 (https://phabricator.wikimedia.org/T385199) (owner: 10Ori) [01:57:11] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1041.eqiad.wmnet [01:57:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:59:08] (03CR) 10Ori: [C:03+2] Fix channel name for ArcLamp pipeline for PHP8 [puppet] - 10https://gerrit.wikimedia.org/r/1117998 (https://phabricator.wikimedia.org/T385199) (owner: 10Ori) [02:00:38] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [02:00:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10531061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm [02:04:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:05:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T384592)', diff saved to https://phabricator.wikimedia.org/P73354 and previous config saved to /var/cache/conftool/dbconfig/20250207-025628-marostegui.json [02:56:32] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P73355 and previous config saved to /var/cache/conftool/dbconfig/20250207-031134-marostegui.json [03:14:52] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [03:15:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10531105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [03:26:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P73356 and previous config saved to /var/cache/conftool/dbconfig/20250207-032642-marostegui.json [03:39:13] (03PS1) 10RLazarus: mediawiki: Fix default-merging logic in _site_helpers.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118003 (https://phabricator.wikimedia.org/T385228) [03:41:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T384592)', diff saved to https://phabricator.wikimedia.org/P73357 and previous config saved to /var/cache/conftool/dbconfig/20250207-034149-marostegui.json [03:41:53] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:42:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [03:54:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:32] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118004 (https://phabricator.wikimedia.org/T385771) [06:11:49] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531148 (10Marostegui) [06:13:32] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531149 (10Marostegui) [06:14:58] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531152 (10Marostegui) [06:15:46] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531155 (10Marostegui) @KFrancis I don't see this user in the NDA spreadsheet, can you confirm whether this is signed somewhere else? Thanks. [06:22:42] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531157 (10Marostegui) @EPIC the ssh key you've provided is malformed, please provide the correct format (maybe you didn't copy/paste it entirely) [06:22:58] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531158 (10Marostegui) [06:23:36] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531159 (10Marostegui) [06:24:13] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10531160 (10Marostegui) p:05Triage→03Medium [06:27:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1174 db2150', diff saved to https://phabricator.wikimedia.org/P73358 and previous config saved to /var/cache/conftool/dbconfig/20250207-062745-marostegui.json [06:28:13] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2150.codfw.wmnet [06:28:19] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1174.eqiad.wmnet [06:28:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1234 db2145', diff saved to https://phabricator.wikimedia.org/P73359 and previous config saved to /var/cache/conftool/dbconfig/20250207-062857-marostegui.json [06:29:28] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2145.codfw.wmnet [06:29:36] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1234.eqiad.wmnet [06:30:15] (03PS1) 10Marostegui: installserver: Do not format db1252 [puppet] - 10https://gerrit.wikimedia.org/r/1118005 [06:31:25] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118004 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira) [06:34:37] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1174.eqiad.wmnet [06:34:50] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2150.codfw.wmnet [06:35:21] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Index rebuild [06:35:30] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Index rebuild [06:35:42] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1234.eqiad.wmnet [06:35:58] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db1252 [puppet] - 10https://gerrit.wikimedia.org/r/1118005 (owner: 10Marostegui) [06:36:23] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2145.codfw.wmnet [06:36:47] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Index rebuild [06:36:58] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1234.eqiad.wmnet with reason: Index rebuild [06:55:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1029 to es1 master', diff saved to https://phabricator.wikimedia.org/P73360 and previous config saved to /var/cache/conftool/dbconfig/20250207-065546-root.json [06:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1027', diff saved to https://phabricator.wikimedia.org/P73361 and previous config saved to /var/cache/conftool/dbconfig/20250207-065600-marostegui.json [06:56:09] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es1027.eqiad.wmnet [06:57:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1026 to es2 master', diff saved to https://phabricator.wikimedia.org/P73362 and previous config saved to /var/cache/conftool/dbconfig/20250207-065700-root.json [06:57:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1030', diff saved to https://phabricator.wikimedia.org/P73363 and previous config saved to /var/cache/conftool/dbconfig/20250207-065730-marostegui.json [06:57:39] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es1030.eqiad.wmnet [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250207T0700) [07:01:37] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1027.eqiad.wmnet [07:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73364 and previous config saved to /var/cache/conftool/dbconfig/20250207-070156-root.json [07:05:02] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118004 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira) [07:06:05] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1030.eqiad.wmnet [07:06:07] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118004 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira) [07:06:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73365 and previous config saved to /var/cache/conftool/dbconfig/20250207-070617-root.json [07:08:13] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:12:04] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:13:41] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73366 and previous config saved to /var/cache/conftool/dbconfig/20250207-071702-root.json [07:21:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73367 and previous config saved to /var/cache/conftool/dbconfig/20250207-072122-root.json [07:31:30] (03CR) 10Klausman: [C:03+2] profile/roles/...: Drop all Wikilabels classes and files [puppet] - 10https://gerrit.wikimedia.org/r/1117920 (owner: 10Klausman) [07:32:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73368 and previous config saved to /var/cache/conftool/dbconfig/20250207-073207-root.json [07:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73369 and previous config saved to /var/cache/conftool/dbconfig/20250207-073627-root.json [07:47:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73370 and previous config saved to /var/cache/conftool/dbconfig/20250207-074712-root.json [07:51:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73371 and previous config saved to /var/cache/conftool/dbconfig/20250207-075132-root.json [07:59:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250207T0800) [08:02:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73372 and previous config saved to /var/cache/conftool/dbconfig/20250207-080218-root.json [08:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73373 and previous config saved to /var/cache/conftool/dbconfig/20250207-080638-root.json [08:24:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:32:09] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:33] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:37:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:13:56] (03CR) 10Anzx: SITENAME, project namespace, and timezone change of Serbo-Croatian Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117960 (https://phabricator.wikimedia.org/T385833) (owner: 10Acamicamacaraca) [09:14:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [09:15:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T384592)', diff saved to https://phabricator.wikimedia.org/P73374 and previous config saved to /var/cache/conftool/dbconfig/20250207-091459-marostegui.json [09:15:03] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:27:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:29:33] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:09] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:36:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73375 and previous config saved to /var/cache/conftool/dbconfig/20250207-093649-root.json [09:39:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:47:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73376 and previous config saved to /var/cache/conftool/dbconfig/20250207-094756-root.json [09:51:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73377 and previous config saved to /var/cache/conftool/dbconfig/20250207-095154-root.json [10:03:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73378 and previous config saved to /var/cache/conftool/dbconfig/20250207-100302-root.json [10:07:00] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7 [10:07:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73379 and previous config saved to /var/cache/conftool/dbconfig/20250207-100700-root.json [10:07:04] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2 [10:08:03] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Rebooting clouddb1014 T384946 [10:15:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73380 and previous config saved to /var/cache/conftool/dbconfig/20250207-101559-root.json [10:18:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73381 and previous config saved to /var/cache/conftool/dbconfig/20250207-101807-root.json [10:18:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:22:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73382 and previous config saved to /var/cache/conftool/dbconfig/20250207-102205-root.json [10:24:53] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2 [10:24:56] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7 [10:30:53] !log fnegri@cumin1002 START - Cookbook sre.hosts.remove-downtime for clouddb1014.eqiad.wmnet [10:30:53] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1014.eqiad.wmnet [10:31:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73383 and previous config saved to /var/cache/conftool/dbconfig/20250207-103104-root.json [10:33:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73384 and previous config saved to /var/cache/conftool/dbconfig/20250207-103312-root.json [10:37:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73385 and previous config saved to /var/cache/conftool/dbconfig/20250207-103710-root.json [10:40:02] (03PS1) 10Jelto: wcqs: proxy requests to query qui to new wikikube endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) [10:42:41] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS4265007002/IPv4: Connect - asw1-b4-magru https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) (owner: 10Jelto) [10:44:03] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:03] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:46:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73386 and previous config saved to /var/cache/conftool/dbconfig/20250207-104609-root.json [10:48:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73387 and previous config saved to /var/cache/conftool/dbconfig/20250207-104818-root.json [11:01:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73388 and previous config saved to /var/cache/conftool/dbconfig/20250207-110114-root.json [11:05:31] (03Abandoned) 10Muehlenhoff: wikilabels::db: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1115070 (owner: 10Muehlenhoff) [11:06:52] (03PS1) 10Muehlenhoff: Switch ganeti1033 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1118078 [11:09:49] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1033 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1118078 (owner: 10Muehlenhoff) [11:13:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:14:40] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73389 and previous config saved to /var/cache/conftool/dbconfig/20250207-111619-root.json [11:26:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73390 and previous config saved to /var/cache/conftool/dbconfig/20250207-112624-root.json [11:28:45] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Rebooting clouddb1018 T384946 [11:29:09] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [11:29:11] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2 [11:34:05] (03PS3) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [11:35:11] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:35:19] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:35:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:35:57] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:38:30] (03CR) 10Andrew Bogott: [C:03+2] Designate: unset legacy_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/1117631 (https://phabricator.wikimedia.org/T384118) (owner: 10Andrew Bogott) [11:40:02] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:40:08] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:41:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73391 and previous config saved to /var/cache/conftool/dbconfig/20250207-114129-root.json [11:42:43] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [11:42:47] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [11:50:08] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ganeti1033.eqiad.wmnet [11:51:04] (03PS4) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [11:56:15] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873 (10MoritzMuehlenhoff) 03NEW [11:56:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73392 and previous config saved to /var/cache/conftool/dbconfig/20250207-115634-root.json [11:58:50] (03PS5) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [11:59:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250207T0800) [12:00:05] jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250207T1200). [12:03:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [12:04:39] (03PS1) 10Clément Goubert: Wmflib::Php_version: Support php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1118095 (https://phabricator.wikimedia.org/T378752) [12:05:03] (03PS6) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [12:08:36] (03PS2) 10Clément Goubert: Wmflib::Php_version: Support php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1118095 (https://phabricator.wikimedia.org/T378752) [12:09:18] (03CR) 10Ladsgroup: Stop producing Yahoo! abstract dumps [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [12:10:02] (03PS4) 10Acamicamacaraca: SITENAME, project namespace, and timezone change of Serbo-Croatian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117960 (https://phabricator.wikimedia.org/T385833) [12:10:26] (03CR) 10Acamicamacaraca: SITENAME, project namespace, and timezone change of Serbo-Croatian Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117960 (https://phabricator.wikimedia.org/T385833) (owner: 10Acamicamacaraca) [12:11:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [12:11:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73393 and previous config saved to /var/cache/conftool/dbconfig/20250207-121140-root.json [12:11:54] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118095 (https://phabricator.wikimedia.org/T378752) (owner: 10Clément Goubert) [12:19:10] (03PS7) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [12:21:59] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [12:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2150 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73394 and previous config saved to /var/cache/conftool/dbconfig/20250207-122645-root.json [12:27:34] (03PS1) 10Federico Ceratto: Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 [12:28:48] (03PS1) 10Federico Ceratto: clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 [12:31:14] (03CR) 10Federico Ceratto: [C:04-1] "Currently work in progress" [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 (owner: 10Federico Ceratto) [12:35:08] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 (owner: 10Federico Ceratto) [12:37:34] (03PS1) 10Andrew Bogott: wmcs-dnsleaks: when --doublecheck is set, actually check the second run [puppet] - 10https://gerrit.wikimedia.org/r/1118106 (https://phabricator.wikimedia.org/T384118) [12:38:10] (03CR) 10Andrew Bogott: [C:03+2] wmcs-dnsleaks: when --doublecheck is set, actually check the second run [puppet] - 10https://gerrit.wikimedia.org/r/1118106 (https://phabricator.wikimedia.org/T384118) (owner: 10Andrew Bogott) [12:51:59] (03PS11) 10Clément Goubert: mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) [12:52:23] (03PS8) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [12:52:25] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [12:52:37] (03CR) 10Clément Goubert: "PCC failure is due to deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert) [13:01:34] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [13:04:07] (03PS1) 10FNegri: alertmanager: enable send_resolved for WMCS emails [puppet] - 10https://gerrit.wikimedia.org/r/1118110 [13:06:13] (03CR) 10Andrew Bogott: [C:03+1] alertmanager: enable send_resolved for WMCS emails [puppet] - 10https://gerrit.wikimedia.org/r/1118110 (owner: 10FNegri) [13:09:06] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] alertmanager: enable send_resolved for WMCS emails [puppet] - 10https://gerrit.wikimedia.org/r/1118110 (owner: 10FNegri) [13:31:46] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:44:14] (03CR) 10Kamila Součková: [C:03+1] mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert) [13:52:03] (03CR) 10FNegri: [C:03+2] alertmanager: enable send_resolved for WMCS emails [puppet] - 10https://gerrit.wikimedia.org/r/1118110 (owner: 10FNegri) [14:02:32] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4 [14:02:35] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s6 [14:02:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1014.eqiad.wmnet with reason: maintenance [14:03:03] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1015.eqiad.wmnet [14:03:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1018.eqiad.wmnet with reason: maintenance [14:10:41] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1015.eqiad.wmnet [14:11:10] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Rebooting clouddb1015 T384946 [14:18:49] (03PS1) 10AikoChou: admin_ng: lower memory limitranges for revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118122 [14:20:08] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s6 [14:20:11] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4 [14:20:18] !log fnegri@cumin1002 START - Cookbook sre.hosts.remove-downtime for clouddb1015.eqiad.wmnet [14:20:19] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1015.eqiad.wmnet [14:21:01] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1117998 (https://phabricator.wikimedia.org/T385199) (owner: 10Ori) [14:21:34] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 [14:21:38] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s6 [14:22:13] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Rebooting clouddb1019 T384946 [14:23:26] (03CR) 10AikoChou: "After checking the resource usage for reference models, we don't need such high memory limitranges." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118122 (owner: 10AikoChou) [14:29:22] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: lower memory limitranges for revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118122 (owner: 10AikoChou) [14:33:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:33:21] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM, my only question is whether it makes sense to add something (fixtures?) so that we catch this in CI too. Unless I missed something t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118003 (https://phabricator.wikimedia.org/T385228) (owner: 10RLazarus) [14:34:15] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:37] (03PS1) 10Bking: wdqs-main: Start hosting wdqs-categories service [puppet] - 10https://gerrit.wikimedia.org/r/1118124 [14:35:28] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 [14:35:31] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s6 [14:36:34] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Rebooting clouddb1020 T384946 [14:36:49] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5 [14:36:51] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118124 (owner: 10Bking) [14:50:44] (03PS2) 10Bking: wdqs-main: Start hosting wdqs-categories service [puppet] - 10https://gerrit.wikimedia.org/r/1118124 [14:50:50] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [14:50:55] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5 [14:53:45] (03PS1) 10Andrew Bogott: Remove adminscript wmcs-puppetcertleaks [puppet] - 10https://gerrit.wikimedia.org/r/1118127 [14:53:45] (03PS1) 10Andrew Bogott: Remove a couple of cloudinfra host hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1118128 [14:53:46] (03PS1) 10Andrew Bogott: Cleanup: remove old wmcs puppetmaster frontend/backend code [puppet] - 10https://gerrit.wikimedia.org/r/1118129 [14:54:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118124 (owner: 10Bking) [14:55:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1170', diff saved to https://phabricator.wikimedia.org/P73395 and previous config saved to /var/cache/conftool/dbconfig/20250207-145547-marostegui.json [14:56:03] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1170.eqiad.wmnet [15:02:43] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1170.eqiad.wmnet [15:03:05] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Index rebuild [15:03:50] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:04:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: maintenance [15:04:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: maintenance [15:04:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118129 (owner: 10Andrew Bogott) [15:07:41] (03CR) 10Andrew Bogott: [C:03+2] Remove adminscript wmcs-puppetcertleaks [puppet] - 10https://gerrit.wikimedia.org/r/1118127 (owner: 10Andrew Bogott) [15:07:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:47] (03CR) 10Andrew Bogott: [C:03+2] Remove a couple of cloudinfra host hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1118128 (owner: 10Andrew Bogott) [15:31:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T384592)', diff saved to https://phabricator.wikimedia.org/P73396 and previous config saved to /var/cache/conftool/dbconfig/20250207-153103-marostegui.json [15:31:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:32:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:46:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P73397 and previous config saved to /var/cache/conftool/dbconfig/20250207-154610-marostegui.json [15:55:20] (03CR) 10BryanDavis: "I would rather see someone merge Ifab1be8e8e7b634e8875c597b586e6d69496b476 which has been running in deployment-prep since mid-December." [puppet] - 10https://gerrit.wikimedia.org/r/1118095 (https://phabricator.wikimedia.org/T378752) (owner: 10Clément Goubert) [15:58:04] (03CR) 10JHathaway: [C:03+1] wdqs-main: Start hosting wdqs-categories service [puppet] - 10https://gerrit.wikimedia.org/r/1118124 (owner: 10Bking) [15:58:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10532058 (10Jhancock.wm) [16:01:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P73398 and previous config saved to /var/cache/conftool/dbconfig/20250207-160117-marostegui.json [16:02:40] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:00] (03Abandoned) 10Clément Goubert: Wmflib::Php_version: Support php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1118095 (https://phabricator.wikimedia.org/T378752) (owner: 10Clément Goubert) [16:08:01] (03PS16) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [16:08:51] (03CR) 10Dzahn: [C:03+1] "testing if the vote stays when I remove my user from the attention set. just a test" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [16:09:06] (03PS12) 10Clément Goubert: mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) [16:09:38] (03CR) 10Dzahn: "I am still +1 on this and once we have the domain feel free to add me back." [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [16:09:46] (03PS13) 10Clément Goubert: mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) [16:13:39] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [16:13:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10532096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [16:15:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm [16:15:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10532101 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed... [16:16:03] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [16:16:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10532102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm [16:16:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T384592)', diff saved to https://phabricator.wikimedia.org/P73399 and previous config saved to /var/cache/conftool/dbconfig/20250207-161624-marostegui.json [16:16:27] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:16:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [16:16:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T384592)', diff saved to https://phabricator.wikimedia.org/P73400 and previous config saved to /var/cache/conftool/dbconfig/20250207-161646-marostegui.json [16:19:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[56] - https://phabricator.wikimedia.org/T379753#10532134 (10VRiley-WMF) [16:26:10] 06SRE, 06Traffic: create a puppetized abstraction for haproxy blocklist hysteresis - https://phabricator.wikimedia.org/T329331#10532148 (10CDanis) 05Open→03Declined Instead we accomplished this via requestctl. {T371144} [16:29:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10532161 (10phaultfinder) [16:31:33] (03CR) 10Scott French: [C:03+1] mediawiki: Fix default-merging logic in _site_helpers.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118003 (https://phabricator.wikimedia.org/T385228) (owner: 10RLazarus) [16:31:42] (03PS1) 10CDanis: new HIDDENPARMA release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1118146 (https://phabricator.wikimedia.org/T371144) [16:32:11] (03CR) 10CDanis: [V:03+2 C:03+2] new HIDDENPARMA release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1118146 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [16:33:01] !log cdanis@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin1002" [16:33:02] !log cdanis@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin1002 [16:33:34] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin1002 [16:33:36] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin1002" [16:33:55] pushed those in the wrong order lol [16:34:01] (03CR) 10Dzahn: [C:03+2] "@arnaudb just fyi, I actually broke puppet with this and then sukhe fixed it with https://gerrit.wikimedia.org/r/c/operations/puppet/+/111" [puppet] - 10https://gerrit.wikimedia.org/r/1117580 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [16:34:10] !log cdanis@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin1002" [16:34:13] !log cdanis@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin1002 [16:34:43] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin1002 [16:34:44] !log cdanis@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin1002" [16:39:03] 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10532181 (10Marostegui) [16:40:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10532183 (10Marostegui) [16:46:16] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:52:41] FIRING: [9x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:53:14] (03PS2) 10Federico Ceratto: clone.py, clone_test.py: Implement full DB cloning runbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 [16:53:18] oh, that's probably me [16:53:55] cdanis: finders keepers! [16:54:09] [on-call people around if you need an extra pair of hands] [16:54:33] yeah, it's me [17:02:41] RESOLVED: [9x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:10:43] (03PS3) 10Bking: wdqs-main: Start hosting wdqs-categories service [puppet] - 10https://gerrit.wikimedia.org/r/1118124 (https://phabricator.wikimedia.org/T385896) [17:11:09] (03CR) 10Bking: [V:03+2 C:03+2] wdqs-main: Start hosting wdqs-categories service [puppet] - 10https://gerrit.wikimedia.org/r/1118124 (https://phabricator.wikimedia.org/T385896) (owner: 10Bking) [17:11:51] (03PS2) 10Majavah: puppet_compiler: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095190 (https://phabricator.wikimedia.org/T380679) [17:12:05] (03PS2) 10Majavah: P:cumin: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095191 (https://phabricator.wikimedia.org/T380679) [17:16:28] (03CR) 10Andrew Bogott: "Right now I see deployment-cumin-3 running Bullseye, built 2024-07-12. Is that the one that needs to be gone, or that a replacement for th" [puppet] - 10https://gerrit.wikimedia.org/r/1095192 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:17:31] (03PS2) 10Andrew Bogott: Cleanup: remove old wmcs puppetmaster frontend/backend code [puppet] - 10https://gerrit.wikimedia.org/r/1118129 [17:19:00] (03CR) 10Majavah: [C:03+1] Cleanup: remove old wmcs puppetmaster frontend/backend code [puppet] - 10https://gerrit.wikimedia.org/r/1118129 (owner: 10Andrew Bogott) [17:19:26] (03CR) 10Majavah: "the replacement!" [puppet] - 10https://gerrit.wikimedia.org/r/1095192 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:21:43] (03PS3) 10Andrew Bogott: puppet_compiler: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095190 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:21:44] (03PS3) 10Andrew Bogott: P:cumin: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095191 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:21:44] (03PS2) 10Andrew Bogott: openstack: puppet: Drop support for .wmflabs names [puppet] - 10https://gerrit.wikimedia.org/r/1095193 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:23:15] (03PS1) 10Andrew Bogott: cloud-vps resolv.conf: remove .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) [17:24:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:24:32] (03Abandoned) 10Andrew Bogott: openstack: admin_scripts: Remove support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095192 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:24:49] Does +2 a beta config change count as a friday deploy? [17:25:20] (03CR) 10Andrew Bogott: [C:03+1] puppet_compiler: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095190 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:25:47] (03CR) 10Andrew Bogott: [C:03+1] P:cumin: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095191 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:26:21] jan_drewniak: Go for it. It will not result in an actual production deployment. [17:26:23] jan_drewniak: do not quote me on this, but patches that don't need to be synced to prod (only touch -labs files) at least used to be fine [17:26:35] (03CR) 10Majavah: [C:03+2] puppet_compiler: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095190 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:26:42] you _do_ need to run scap backport for the change, but it's smart enough to shortcut the sync. [17:26:49] (03CR) 10Majavah: [C:03+2] P:cumin: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095191 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:26:58] (03CR) 10Andrew Bogott: [C:03+1] openstack: puppet: Drop support for .wmflabs names [puppet] - 10https://gerrit.wikimedia.org/r/1095193 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [17:27:07] dancy, taavi, thanks for the assurance :P [17:27:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118129 (owner: 10Andrew Bogott) [17:27:54] (03CR) 10Jdrewniak: [C:03+2] Beta: Adjust config for Web search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117951 (https://phabricator.wikimedia.org/T385309) (owner: 10Jdrewniak) [17:28:25] (03CR) 10Majavah: "Did we implement something to ensure that the tools-redis/tools-db service names keep working there?" [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [17:28:42] (03Merged) 10jenkins-bot: Beta: Adjust config for Web search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117951 (https://phabricator.wikimedia.org/T385309) (owner: 10Jdrewniak) [17:29:01] jan_drewniak: note that they still will need to be pulled down to the real deployment server or otherwise the next deployer on monday will be confused [17:29:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:29:43] taavi: for sure 👍 [17:29:43] (03CR) 10Andrew Bogott: [C:03+2] Cleanup: remove old wmcs puppetmaster frontend/backend code [puppet] - 10https://gerrit.wikimedia.org/r/1118129 (owner: 10Andrew Bogott) [17:30:25] FIRING: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-categories.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:47] (03CR) 10Andrew Bogott: "Do services refer to those hosts with just a hostname rather than fqdn? If so then we can't merge this one yet." [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [17:31:52] (03PS2) 10Majavah: realm: stop setting labsproject [puppet] - 10https://gerrit.wikimedia.org/r/916425 [17:32:23] (03CR) 10Majavah: "I'm pretty sure those short hostnames were the documented way to access the service for a while." [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [17:32:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1026:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:35:25] RESOLVED: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-categories.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:42] (03CR) 10Andrew Bogott: [C:04-2] "Need to make sure we have .wikimedia.cloud cnames in place for redis and clouddb servers before this can be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1118151 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [17:35:58] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test nokia switches - cmooney@cumin1002" [17:36:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add mgmt dns names for test nokia switches - cmooney@cumin1002" [17:36:04] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:08] (03CR) 10RLazarus: [C:03+2] "Thanks! Yeah, that might be a good idea. Right now we have no fixtures coverage for anything under mw.sites -- instead we use httpbb tests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118003 (https://phabricator.wikimedia.org/T385228) (owner: 10RLazarus) [17:37:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm [17:38:06] (03Merged) 10jenkins-bot: mediawiki: Fix default-merging logic in _site_helpers.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118003 (https://phabricator.wikimedia.org/T385228) (owner: 10RLazarus) [17:38:44] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [17:38:54] sukhe, jhathaway: okay with you if I push that out and keep an eye on it? it's not strictly speaking a no-op wrt the apache config, but it's a no-op wrt any parts of the config that I believe anyone cares about [17:39:38] rzl: sounds good to me [17:39:43] rzl: I haven't touch this but if you are confident, go for it [17:39:59] (meaning I can't comment on the safety but you can so +1) [17:40:01] thanks <3 I'll be around all day, if I break it I bought it [17:40:12] :D [17:41:23] (and I *am* extremely confident that any problems will show up in the first half-hour, so I'm not worried about the weekend at all) [17:42:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1025:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:44:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:49:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:52:10] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [17:55:53] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1118003 [17:57:03] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1118003 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:58:08] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1255 - vriley@cumin1002" [17:58:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1255 - vriley@cumin1002" [17:58:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:43] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1255 [17:58:53] !log rzl@deploy2002 rzl: Continuing with sync [17:59:24] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [17:59:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1255 [18:01:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1255.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:03:55] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:04:55] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1118003 (duration: 12m 54s) [18:07:11] all done [18:07:22] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1256 - vriley@cumin1002" [18:07:28] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1256 - vriley@cumin1002" [18:07:28] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:12:49] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1256 [18:14:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1256 [18:29:09] (03PS2) 10Dzahn: site: remove requesttracker role from host moscovium [puppet] - 10https://gerrit.wikimedia.org/r/1117598 (https://phabricator.wikimedia.org/T385777) [18:32:09] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1054.eqiad.wmnet with OS bookworm [18:33:27] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1256.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:43:49] (03CR) 10Dzahn: [C:03+2] site: remove requesttracker role from host moscovium [puppet] - 10https://gerrit.wikimedia.org/r/1117598 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [19:08:57] (03PS3) 10Dzahn: rt: remove hiera settings for requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1117529 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [19:16:14] (03CR) 10Dzahn: [C:03+2] rt: remove hiera settings for requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1117529 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [19:34:51] PROBLEM - Categories update lag on wdqs2021 is CRITICAL: CRITICAL - Categories lag: 12 days, 23:34:43.282314 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [19:35:03] (03CR) 10RLazarus: [C:03+2] "Opened https://phabricator.wikimedia.org/T385905." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118003 (https://phabricator.wikimedia.org/T385228) (owner: 10RLazarus) [19:37:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:36] ^^ categories alerts are expected, taking a look now [19:37:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73401 and previous config saved to /var/cache/conftool/dbconfig/20250207-193754-root.json [19:38:36] also not sure why those categories lag alers are going to SRE team, I'll fix that too [19:45:24] (03PS1) 10Bking: wdqs-categories: enable scrapes for jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/1118162 (https://phabricator.wikimedia.org/T385236) [19:48:08] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2018.codfw.wmnet [19:50:57] RECOVERY - Host analytics1073 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:51:01] RECOVERY - SSH on analytics1073 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:52:18] 10ops-eqiad, 06SRE, 06DC-Ops: analytics1073 is unreachable since eight days - https://phabricator.wikimedia.org/T385786#10532644 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated cable and it seems to be back up and running. Closing this out for now. [19:53:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73402 and previous config saved to /var/cache/conftool/dbconfig/20250207-195300-root.json [19:59:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10532661 (10phaultfinder) [20:02:40] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118162 (https://phabricator.wikimedia.org/T385236) (owner: 10Bking) [20:06:08] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2019.codfw.wmnet [20:08:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73403 and previous config saved to /var/cache/conftool/dbconfig/20250207-200805-root.json [20:15:25] PROBLEM - Categories update lag on wdqs2022 is CRITICAL: CRITICAL - Categories lag: 142 days, 15:15:24.430999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag [20:16:06] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10532690 (10MoritzMuehlenhoff) >>! In T383723#10530006, @VRiley-WMF wrote: > Thanks! Yeah, we wouldn't need much downtime for this ganeti1044 de... [20:16:44] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10532691 (10KFrancis) Hi all, I checked my records, and I do have have a NDA for anything under kwa.schultz@gmail.com or user EP1C. What is this person's full name? [20:23:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73404 and previous config saved to /var/cache/conftool/dbconfig/20250207-202311-root.json [20:35:55] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS4265007002/IPv4: Active - asw1-b4-magru https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:38:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73405 and previous config saved to /var/cache/conftool/dbconfig/20250207-203816-root.json [21:04:34] (03Abandoned) 10Bartosz Dziewoński: mediawiki: Allow overwriting true config defaults with false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117922 (https://phabricator.wikimedia.org/T385228) (owner: 10Bartosz Dziewoński) [21:06:56] (03PS3) 10Bartosz Dziewoński: Reapply "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1117924 (https://phabricator.wikimedia.org/T383952) [21:16:22] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2018.codfw.wmnet [21:17:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:18:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T384592)', diff saved to https://phabricator.wikimedia.org/P73406 and previous config saved to /var/cache/conftool/dbconfig/20250207-211851-marostegui.json [21:18:55] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:22:28] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:28:35] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2020.codfw.wmnet [21:33:43] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2019.codfw.wmnet [21:33:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P73407 and previous config saved to /var/cache/conftool/dbconfig/20250207-213357-marostegui.json [21:37:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:38:25] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2021.codfw.wmnet [21:39:50] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1021.eqiad.wmnet [21:40:13] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1022.eqiad.wmnet [21:42:28] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:44:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10532848 (10Jhancock.wm) i got some parts in. a disk controller card and two backplanes (one for each set of drives). I got the card installed first. i need to lookup how to even... [21:49:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P73408 and previous config saved to /var/cache/conftool/dbconfig/20250207-214904-marostegui.json [21:59:13] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:59:55] FIRING: MaxConntrack: Max conntrack at 90.16% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:02:13] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:04:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T384592)', diff saved to https://phabricator.wikimedia.org/P73409 and previous config saved to /var/cache/conftool/dbconfig/20250207-220411-marostegui.json [22:04:14] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [22:04:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [22:04:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T384592)', diff saved to https://phabricator.wikimedia.org/P73410 and previous config saved to /var/cache/conftool/dbconfig/20250207-220433-marostegui.json [22:09:13] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:09:55] RESOLVED: MaxConntrack: Max conntrack at 90.07% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:10:55] FIRING: MaxConntrack: Max conntrack at 90.17% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:14:49] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3448 MB (3% inode=98%): /tmp 3448 MB (3% inode=98%): /var/tmp 3448 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:15:11] RESOLVED: MaxConntrack: Max conntrack at 90.14% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:15:13] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 43 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:26:35] RECOVERY - BGP status on cr2-magru is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:27:49] (03PS1) 10Bvibber: Pref off use of gjl_namespace_text field until it's deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118184 (https://phabricator.wikimedia.org/T385917) [22:31:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118184 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber) [22:58:45] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2020.codfw.wmnet [23:00:15] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1021.eqiad.wmnet [23:01:24] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1022.eqiad.wmnet [23:02:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:04:58] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:06:38] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2021.codfw.wmnet [23:07:28] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs2020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:09:58] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:11:24] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1026.eqiad.wmnet [23:11:33] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1025.eqiad.wmnet [23:14:58] RESOLVED: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:32:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10533157 (10MatthewVernon) The host appears to be down, so I can't look (and I'm just home from the pub, so I'm not about to attempt anything more involved). If you power it up,... [23:34:47] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:47] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:34:57] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:57:49] PROBLEM - Host install7001 is DOWN: PING CRITICAL - Packet loss = 100% [23:57:55] RECOVERY - Host install7001 is UP: PING OK - Packet loss = 0%, RTA = 166.51 ms