[00:02:31] (03PS9) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [00:02:31] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087592|TextPassDumper: refresh content address on failure (T377594)]], [[gerrit:1087593|TextPassDumper: refresh content address on failure (T377594)]] (duration: 08m 48s) [00:02:34] T377594: Fix Dumps - errors exporting good revisions - https://phabricator.wikimedia.org/T377594 [00:07:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:07:46] (03PS2) 10Ryan Kemper: Resume XML/SQL dumps now that data qual fixed [puppet] - 10https://gerrit.wikimedia.org/r/1087598 (https://phabricator.wikimedia.org/T377594) [00:08:10] (03CR) 10Ssingh: "A bit late sorry but as communicated on IRC, looks good and thanks Amir for rolling it out. No ATS restart was required (and you didn't bu" [puppet] - 10https://gerrit.wikimedia.org/r/1087591 (https://phabricator.wikimedia.org/T374683) (owner: 10Arlolra) [00:08:30] (03CR) 10Btullis: [C:03+1] Resume XML/SQL dumps now that data qual fixed [puppet] - 10https://gerrit.wikimedia.org/r/1087598 (https://phabricator.wikimedia.org/T377594) (owner: 10Ryan Kemper) [00:08:45] (03CR) 10Ebernhardson: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1087598 (https://phabricator.wikimedia.org/T377594) (owner: 10Ryan Kemper) [00:09:45] (03CR) 10Ryan Kemper: [C:03+2] Resume XML/SQL dumps now that data qual fixed [puppet] - 10https://gerrit.wikimedia.org/r/1087598 (https://phabricator.wikimedia.org/T377594) (owner: 10Ryan Kemper) [00:11:23] (03CR) 10Santiago Faci: replace list of cassandra hosts with faux values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [00:21:29] !log T377594 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087598; ran puppet on `snapshot101[0-7]*`. These dumps should be re-enabled now [00:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:38] T377594: Fix Dumps - errors exporting good revisions - https://phabricator.wikimedia.org/T377594 [00:38:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1087600 [00:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1087600 (owner: 10TrainBranchBot) [00:41:13] !incidents [00:41:13] 5367 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [00:41:15] !ack 5367 [00:41:16] 5367 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [00:42:17] sukhe: here now as well if you want more hands [00:42:24] which link is it? [00:42:24] swfrench-wmf: thanks <3 [00:42:30] is it equinix [00:43:06] I was about to assume this is US election-related hot linking :) [00:43:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:43:32] should be recovering. [00:43:33] NTT transit [00:43:33] ok [00:43:42] thats interesting [00:44:16] yeah, seems to be that [00:44:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1021 gradually with 4 steps - Maint over [00:44:52] I don’t think we have to do anything unless it happens again [00:45:08] I’m going afk [00:45:23] yeah I am not sure what we can do. go ahead, I will be here intermittently [00:55:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs1013 - https://phabricator.wikimedia.org/T379026#10294950 (10VRiley-WMF) [00:57:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs1013 - https://phabricator.wikimedia.org/T379026#10294952 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This has been decomissioned [01:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1087602 [01:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1087602 (owner: 10TrainBranchBot) [01:10:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1087600 (owner: 10TrainBranchBot) [01:21:31] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087605 (https://phabricator.wikimedia.org/T378260) [01:21:33] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087605 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [01:22:19] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087605 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [01:23:00] !log zabe@deploy2002 Started scap sync-world: T378260 [01:23:03] T378260: Retire labtestwiki - https://phabricator.wikimedia.org/T378260 [01:27:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:30:34] !log zabe@deploy2002 Finished scap sync-world: T378260 (duration: 07m 34s) [01:30:46] T378260: Retire labtestwiki - https://phabricator.wikimedia.org/T378260 [01:31:00] !incidents [01:31:00] 5368 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [01:31:01] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [01:31:08] !ack 5368 [01:31:09] 5368 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [01:32:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:32:32] Lumen [01:32:33] same situation as before (same port, etc.) [01:32:49] sukhe: Lumen? [01:32:55] swfrench-wmf: so NTT as well? [01:32:57] https://librenms.wikimedia.org/graphs/to=1730856600/id=11624/type=port_bits/from=1730770200/ [01:33:15] Lumen also looks unhappy [01:33:21] I have not found the source of this though [01:33:34] this one was AFAICT for xe-3/1/6 again (NTT) [01:33:43] ok [01:34:41] https://w.wiki/BrSp [01:34:44] now let's dig a bit deeper [01:35:06] (03PS1) 10Zabe: mediawiki-cache-warmup: Remove labtestwiki from dbname filter [puppet] - 10https://gerrit.wikimedia.org/r/1087606 (https://phabricator.wikimedia.org/T378260) [01:44:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1087602 (owner: 10TrainBranchBot) [01:53:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:53:44] (03PS1) 10Zabe: snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) [01:54:22] (03CR) 10CI reject: [V:04-1] snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [01:54:36] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [01:55:18] (03PS2) 10Zabe: snapshot: Remove labtestwiki from excluded wikis [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) [01:56:58] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10295009 (10Platonides) @eoghan these messageids are of the form %Y%m%d%H%M%S@test.bug.T377045.wikimedia.es So, filtering at the entries that w... [01:58:47] !incidents [01:58:48] 5369 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [01:58:48] 5368 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [01:58:48] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [01:58:50] !ack 5369 [01:58:50] 5369 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:08:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:12:21] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [02:12:50] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [02:13:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:22:50] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [02:27:05] should recover, smaller spike [02:27:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:33:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:57:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:58:54] !inc!incidents [02:58:58] !incidents [02:58:58] 5372 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:58:59] 5371 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:58:59] 5370 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:58:59] 5369 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:58:59] 5368 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:58:59] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [02:59:02] !ack 5372 [02:59:02] 5372 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:09] (03PS1) 10CDanis: haproxy: bwlim-by-path: also roll out to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1087615 (https://phabricator.wikimedia.org/T317799) [03:07:02] (03PS2) 10CDanis: haproxy: bwlim-by-path: also roll out to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1087615 (https://phabricator.wikimedia.org/T317799) [03:07:06] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087615 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [03:07:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:08:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [03:13:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [03:28:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:33:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:58:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:03:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:09:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [04:13:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [04:16:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:20:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [04:21:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:23:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:24:24] (03PS1) 10KartikMistry: Update cxserver to 2024-10-25-044319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087616 (https://phabricator.wikimedia.org/T377160) [04:25:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [04:28:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [04:31:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [04:36:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [04:46:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:44] Doing cxserver deployment. Minor changes. [05:30:13] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-10-25-044319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087616 (https://phabricator.wikimedia.org/T377160) (owner: 10KartikMistry) [05:31:12] (03Merged) 10jenkins-bot: Update cxserver to 2024-10-25-044319-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087616 (https://phabricator.wikimedia.org/T377160) (owner: 10KartikMistry) [05:32:31] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:33:49] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:34:12] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:36:52] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:37:19] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:37:31] FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:38:06] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:38:41] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:41:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [05:43:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:46:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [05:52:31] FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:52:51] !log Updated cxserver to 2024-10-25-044319-production (T377160, T375102, T371420) [05:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:56] T377160: Post-creation work for annwiki - https://phabricator.wikimedia.org/T377160 [05:52:57] T375102: Post-creation work for nrwiki - https://phabricator.wikimedia.org/T375102 [05:52:57] T371420: Complete enablement Section Translation in new wikis and make the process less manual for the future - https://phabricator.wikimedia.org/T371420 [05:57:31] RESOLVED: Traffic bill over quota: Alert for device cr2-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:04:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [06:19:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [06:26:41] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:32:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [06:36:47] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:41:11] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10295141 (10Papaul) [06:45:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:25] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10295142 (10Papaul) There will be some maintenance in magru sometime next week and the site will be de-pool we can take advantage of this maintenance window to upgrade the router th... [06:57:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T0700) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:19:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:59] (03CR) 10Arnaudb: [C:03+2] Revert "mariadb: wipe /srv on pc1017" [puppet] - 10https://gerrit.wikimedia.org/r/1087499 (owner: 10Arnaudb) [08:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:03:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087508 (owner: 10Muehlenhoff) [08:04:02] (03CR) 10Muehlenhoff: [C:03+2] Add a helper script to setup the Ganeti LVM vg [puppet] - 10https://gerrit.wikimedia.org/r/1087412 (owner: 10Muehlenhoff) [08:07:29] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087572 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:07:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:12:13] !log manually cleared /root/.ssh/known_hosts on the cumin hosts - T336485 [08:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:22] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485#10295209 (10Volans) >>! In T336485#10294334, @cmooney wrote: > I don't see that forced in /etc/ssh/ssh_config though. Also w... [08:12:26] T336485: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 [08:13:13] (03PS1) 10Volans: mysql_legacy: improve pymysql usability [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 [08:13:13] (03PS1) 10Volans: mysql: remove deprecated call [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087853 [08:13:14] (03PS1) 10Volans: mysql_legacy: add MysqlClient class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 [08:13:14] (03PS1) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 [08:13:14] (03PS1) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [08:14:20] (03CR) 10Elukey: [V:03+1 C:03+2] role::aux_k8s: clean up after containerd migration [puppet] - 10https://gerrit.wikimedia.org/r/1085391 (https://phabricator.wikimedia.org/T378345) (owner: 10Elukey) [08:22:00] (03CR) 10Elukey: [C:03+2] ms-be-simple: partman EFI recipe [puppet] - 10https://gerrit.wikimedia.org/r/1087538 (https://phabricator.wikimedia.org/T371400) (owner: 10JHathaway) [08:26:54] (03PS1) 10Elukey: profile::installserver::preseed: use the EFI recipe for ms-be2083 [puppet] - 10https://gerrit.wikimedia.org/r/1087858 (https://phabricator.wikimedia.org/T371400) [08:29:00] (03CR) 10MVernon: [C:03+1] profile::installserver::preseed: use the EFI recipe for ms-be2083 [puppet] - 10https://gerrit.wikimedia.org/r/1087858 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [08:29:13] (03CR) 10Volans: mysql_legacy: add MysqlClient class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [08:33:43] (03CR) 10Muehlenhoff: "Let's not do that, this will only lead to confusion. Going forward all of this information will move as metadata into Bitu, so the additio" [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [08:36:53] (03CR) 10Elukey: [C:03+1] mysql_legacy: improve pymysql usability (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 (owner: 10Volans) [08:37:38] (03CR) 10Elukey: [C:03+1] mysql: remove deprecated call [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087853 (owner: 10Volans) [08:39:17] (03PS1) 10Muehlenhoff: Add the ganeti role to ganeti1043/ganeti1044 [puppet] - 10https://gerrit.wikimedia.org/r/1087859 (https://phabricator.wikimedia.org/T378921) [08:39:21] (03CR) 10Elukey: mysql_legacy: add MysqlClient class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [08:39:58] (03PS1) 10Volans: sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 [08:39:58] (03PS1) 10Volans: Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 [08:40:24] (03CR) 10Elukey: [C:03+1] mysql_legacy: add MysqlClient class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [08:40:36] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:40:42] (03CR) 10Elukey: [C:03+1] mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 (owner: 10Volans) [08:41:32] (03CR) 10Volans: mysql_legacy: add MysqlClient class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [08:41:45] (03CR) 10Elukey: [C:03+1] mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [08:41:58] (03CR) 10Elukey: [C:03+2] profile::installserver::preseed: use the EFI recipe for ms-be2083 [puppet] - 10https://gerrit.wikimedia.org/r/1087858 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [08:42:25] (03CR) 10Volans: "CI failing is because of the depends-on that has not yet been merged/released into spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [08:43:00] (03CR) 10Volans: "CI failing is because of the depends-on that has not yet been merged/released into spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [08:43:15] (03PS2) 10Volans: mysql_legacy: improve pymysql usability [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 [08:43:16] (03PS2) 10Volans: mysql: remove deprecated call [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087853 [08:43:16] (03PS2) 10Volans: mysql_legacy: add MysqlClient class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 [08:43:16] (03PS2) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 [08:43:17] (03PS2) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [08:43:22] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [08:43:22] (03CR) 10CI reject: [V:04-1] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [08:43:50] (03CR) 10Volans: mysql_legacy: improve pymysql usability (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 (owner: 10Volans) [08:44:46] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: improve pymysql usability [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087852 (owner: 10Volans) [08:45:36] FIRING: [7x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:46:39] (03CR) 10Arnaudb: [C:03+1] mysql: remove deprecated call [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087853 (owner: 10Volans) [08:46:41] (03PS1) 10Fabfur: hiera: enable haproxykafka on whole ulsfo dc [puppet] - 10https://gerrit.wikimedia.org/r/1087862 (https://phabricator.wikimedia.org/T378578) [08:46:44] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:46:47] (03CR) 10Muehlenhoff: [C:03+2] Add the ganeti role to ganeti1043/ganeti1044 [puppet] - 10https://gerrit.wikimedia.org/r/1087859 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [08:49:51] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087862 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:51:15] (03PS1) 10Jaime Nuche: Fix category creations [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087863 (https://phabricator.wikimedia.org/T285463) [08:53:00] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: add MysqlClient class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087854 (owner: 10Volans) [08:54:08] (03CR) 10Arnaudb: [C:03+1] mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 (owner: 10Volans) [08:55:03] (03CR) 10Jaime Nuche: "This is currently blocking the train deployment: https://phabricator.wikimedia.org/T375661 A pair of eyes here would be appreciated" [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087863 (https://phabricator.wikimedia.org/T285463) (owner: 10Jaime Nuche) [08:56:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:57:02] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 (owner: 10Volans) [08:58:21] (03PS2) 10Abijeet Patro: Fix automatic category creations by FuzzyBot [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087863 (https://phabricator.wikimedia.org/T285463) (owner: 10Jaime Nuche) [09:00:05] jnuche and dduvall: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T0900). [09:00:36] FIRING: [7x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:01:36] hi there, train is currently blocked on T285463 [09:01:37] T285463: FuzzyBot should automatically create wanted categories that are marked for translation - https://phabricator.wikimedia.org/T285463 [09:05:36] RESOLVED: [3x] Traffic bill over quota: Alert for device cr2-codfw.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:09:07] (03PS1) 10Elukey: sre.hosts.reimage: fix _validate() when using UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1087865 (https://phabricator.wikimedia.org/T373519) [09:10:20] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [09:13:06] (03PS1) 10Arnaudb: mariadb: add es204[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) [09:14:38] (03CR) 10Elukey: "test-cookbooked with ms-be2083 and it worked :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087865 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:15:25] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087865 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:18:25] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: fix _validate() when using UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1087865 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:20:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10295315 (10ABran-WMF) >>! In T378146#10293606, @Jhancock.wm wrote: > @ABran-WMF we've received these servers. Please update the site.pp file. trying to... [09:20:30] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [09:21:32] (03CR) 10Cathal Mooney: [C:03+2] Configure cumin ssh to use network settings for fasw switches [puppet] - 10https://gerrit.wikimedia.org/r/1087572 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:22:03] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10295326 (10Volans) In our netbox config we have for the logging formatters: ` 'django.server': { '()': 'django.utils.log.ServerFormatter', 'format': '[%(server_t... [09:24:21] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10295330 (10cmooney) >>! In T379072#10295326, @Volans wrote: > and I guess server_time is not defined outside of django web app when when we run the `manage.py syncdatasource` co... [09:25:14] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1043 [09:25:23] (03CR) 10Abijeet Patro: [C:03+1] Fix automatic category creations by FuzzyBot [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087863 (https://phabricator.wikimedia.org/T285463) (owner: 10Jaime Nuche) [09:27:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1043 [09:28:02] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1044 [09:29:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1044 [09:30:45] (03CR) 10Jaime Nuche: [C:03+2] Fix automatic category creations by FuzzyBot [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087863 (https://phabricator.wikimedia.org/T285463) (owner: 10Jaime Nuche) [09:31:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [09:34:35] (03PS1) 10Lucas Werkmeister (WMDE): tables-catalog: Add GlobalUsage (globalimagelinks) [puppet] - 10https://gerrit.wikimedia.org/r/1087867 (https://phabricator.wikimedia.org/T363581) [09:35:27] (03CR) 10Arnaudb: [C:03+1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [09:36:41] (03CR) 10Arnaudb: [C:03+1] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [09:37:12] (03PS1) 10Elukey: sre.hosts.reimage: fix remote command to use to test if d-i started [cookbooks] - 10https://gerrit.wikimedia.org/r/1087869 (https://phabricator.wikimedia.org/T373519) [09:38:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [09:38:34] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [09:41:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [09:43:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:59] (03Merged) 10jenkins-bot: Fix automatic category creations by FuzzyBot [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087863 (https://phabricator.wikimedia.org/T285463) (owner: 10Jaime Nuche) [09:49:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [09:49:17] (03PS2) 10Slyngshede: P:idp enable Redis TGT backend [puppet] - 10https://gerrit.wikimedia.org/r/1087167 (https://phabricator.wikimedia.org/T377728) [09:49:45] (03PS3) 10Slyngshede: P:idp enable Redis TGT backend [puppet] - 10https://gerrit.wikimedia.org/r/1087167 (https://phabricator.wikimedia.org/T377728) [09:50:54] (03CR) 10Elukey: "The current /proc/cmd value is:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087869 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:51:38] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1087863|Fix automatic category creations by FuzzyBot (T285463)]] [09:51:41] T285463: FuzzyBot should automatically create wanted categories that are marked for translation - https://phabricator.wikimedia.org/T285463 [09:52:07] (03CR) 10Marostegui: "remember these need to go to zarcillo with the final rack location too." [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) (owner: 10Arnaudb) [09:52:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1043.eqiad.wmnet to cluster eqiad and group B [09:52:30] (03CR) 10Marostegui: [C:04-1] "missing partman recipe" [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) (owner: 10Arnaudb) [09:53:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:45] (03CR) 10Ayounsi: [C:03+1] sre.hosts.reimage: fix remote command to use to test if d-i started [cookbooks] - 10https://gerrit.wikimedia.org/r/1087869 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:53:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1043.eqiad.wmnet to cluster eqiad and group B [09:54:19] (03CR) 10Volans: [C:03+1] "checking preseed LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087869 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:54:36] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1087863|Fix automatic category creations by FuzzyBot (T285463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:54:51] !log jnuche@deploy2002 jnuche: Continuing with sync [09:54:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1044.eqiad.wmnet to cluster eqiad and group B [09:55:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1044.eqiad.wmnet to cluster eqiad and group B [09:55:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10295412 (10MoritzMuehlenhoff) >>! In T365650#10287518, @elukey wrote: > Fixed 1044. For some reason IPv6 support was disabled, so our settings like `IPv6Au... [09:57:15] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: fix remote command to use to test if d-i started [cookbooks] - 10https://gerrit.wikimedia.org/r/1087869 (https://phabricator.wikimedia.org/T373519) (owner: 10Elukey) [09:57:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295413 (10MoritzMuehlenhoff) [09:59:42] !log jnuche@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087863|Fix automatic category creations by FuzzyBot (T285463)]] (duration: 08m 03s) [09:59:45] T285463: FuzzyBot should automatically create wanted categories that are marked for translation - https://phabricator.wikimedia.org/T285463 [09:59:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to drbd [10:05:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295447 (10ops-monitoring-bot) VM ml-etcd1001.eqiad.wmnet switching disk type to drbd [10:12:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to drbd [10:12:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [10:13:35] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10295498 (10MatthewVernon) @Deepesha_WMDE do you have a Wikimedia developer account? If so, what is the username? If not, can you create one following [[ https://wikitech.wikimedia.org/wi... [10:15:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295503 (10ops-monitoring-bot) Draining ganeti1014.eqiad.wmnet of running VMs [10:15:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [10:15:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [10:16:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295504 (10ops-monitoring-bot) Draining ganeti1014.eqiad.wmnet of running VMs [10:16:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [10:16:31] (03PS1) 10Slyngshede: Block Search: Priorities form input over query params. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087875 (https://phabricator.wikimedia.org/T378338) [10:17:01] jouncebot: now [10:17:01] For the next 0 hour(s) and 42 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T0900) [10:17:02] (03PS2) 10Fabfur: hiera: enable haproxykafka on whole ulsfo dc [puppet] - 10https://gerrit.wikimedia.org/r/1087862 (https://phabricator.wikimedia.org/T378578) [10:17:19] ah, nevermind, jnuche already deployed https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1087863 :) [10:18:16] Lucas_WMDE: yeah, I'm waiting for confirmation the fix worked, then I'll continue with the train [10:19:04] nice [10:26:05] (03PS2) 10Arnaudb: mariadb: add es204[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) [10:26:20] (03CR) 10Arnaudb: "good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) (owner: 10Arnaudb) [10:26:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10295522 (10elukey) @MatthewVernon I tried to provision/reimage ms-be2083 with UEFI but we have the same `/dev/disk/by-path` duplication issue, I think it... [10:27:48] (03PS3) 10Arnaudb: mariadb: add es204[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) [10:27:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to plain [10:28:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295524 (10ops-monitoring-bot) VM ml-etcd1001.eqiad.wmnet switching disk type to plain [10:28:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to plain [10:28:38] (03PS1) 10Lucas Werkmeister (WMDE): Document available wbformatvalue options [extensions/Wikibase] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087877 (https://phabricator.wikimedia.org/T323778) [10:29:26] (03CR) 10Lucas Werkmeister (WMDE): "I think I’d like to backport this so it’s live when we send the breaking change announcement out. (But backports with i18n changes have be" [extensions/Wikibase] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087877 (https://phabricator.wikimedia.org/T323778) (owner: 10Lucas Werkmeister (WMDE)) [10:29:59] fix seemed to have worked, train will be rolling ahead in the next few minutes [10:30:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Wikibase] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087877 (https://phabricator.wikimedia.org/T323778) (owner: 10Lucas Werkmeister (WMDE)) [10:32:35] !log push new pfw policies - T379127 [10:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087862 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:33:16] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087878 (https://phabricator.wikimedia.org/T375661) [10:33:18] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087878 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [10:34:03] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087878 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [10:35:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 (owner: 10Hamish) [10:36:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10295541 (10MatthewVernon) Alas :( I think adjusting the fact is the way to go? Presumably it now needs to keep track of the targets of the symlinks in... [10:39:29] (03CR) 10Elukey: [V:03+1] "Adding also Brian and David since the tlproxy::localssl is deployed on elastic nodes, even if it will be a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [10:39:29] (03CR) 10Vgutierrez: [C:03+1] hiera: enable haproxykafka on whole ulsfo dc [puppet] - 10https://gerrit.wikimedia.org/r/1087862 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:41:15] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.2 refs T375661 [10:41:18] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [10:42:44] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on whole ulsfo dc [puppet] - 10https://gerrit.wikimedia.org/r/1087862 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:43:06] !log depool maps1005 to test an nginx config - T378944 [10:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s - https://phabricator.wikimedia.org/T378944 [10:43:33] !log rolling out haproxykafka on all ULSFO cp hosts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087862) (T378578) [10:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:43] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [10:45:20] (03CR) 10Marostegui: [C:03+1] mariadb: add es204[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) (owner: 10Arnaudb) [10:45:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10295570 (10ABran-WMF) so, 1020 would go in [[ https://netbox.wikimedia.org/dcim/racks/2/ | A2 ]] es1045 in [[ https://netbox.wikimedia.org/dcim/racks/8... [10:45:44] (03CR) 10Arnaudb: [C:03+2] mariadb: add es204[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1087866 (https://phabricator.wikimedia.org/T378146) (owner: 10Arnaudb) [10:47:03] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10295572 (10ABran-WMF) >>! In T378146#10295315, @ABran-WMF wrote: > done, but given that's my first install task, I'll wait for CR approval to come! done! [10:50:50] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [10:51:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087167 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [10:55:56] (03CR) 10Slyngshede: [C:03+2] P:idp enable Redis TGT backend [puppet] - 10https://gerrit.wikimedia.org/r/1087167 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [10:56:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10295607 (10Marostegui) >>! In T378146#10293606, @Jhancock.wm wrote: > @Marostegui this should make it diverse. lmk if you want something different. > e... [10:56:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10295608 (10elukey) >>! In T371400#10295541, @MatthewVernon wrote: > Alas :( > > I think adjusting the fact is the way to go? Presumably it now needs to... [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1100) [11:05:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:05:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Confirming that lines were only moved around and not changed, as far as I can tell." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 (owner: 10Hamish) [11:06:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10295649 (10MoritzMuehlenhoff) I had been running into issues with moving VMs to ganeti1041 this morning (which is already added to the Ganeti cluster) and... [11:08:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [11:08:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10295654 (10ops-monitoring-bot) Draining ganeti1041.eqiad.wmnet of running VMs [11:17:12] (03PS1) 10Vgutierrez: hiera,liberica: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1087887 (https://phabricator.wikimedia.org/T377127) [11:18:05] (03PS2) 10Vgutierrez: hiera,liberica: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1087887 (https://phabricator.wikimedia.org/T377127) [11:18:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087887 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [11:19:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:56] (03CR) 10Vgutierrez: [C:03+2] hiera,liberica: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1087887 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [11:25:48] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for memcached/redis roles [puppet] - 10https://gerrit.wikimedia.org/r/1083160 (owner: 10Muehlenhoff) [11:30:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [11:30:06] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:30:20] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:31:53] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:32:42] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:34:03] PROBLEM - Host ganeti1041 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:05] PROBLEM - Host ganeti1044 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:12] these two are me --^ [11:36:33] RECOVERY - Host ganeti1044 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [11:37:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:37:17] RECOVERY - Host ganeti1041 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [11:37:37] FIRING: ProbeDown: Service ganeti1041:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1041.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:38:20] (03PS1) 10Muehlenhoff: sre.hosts.provision: Turn virt warning into a hard error [cookbooks] - 10https://gerrit.wikimedia.org/r/1087889 [11:38:58] RESOLVED: ProbeDown: Service ganeti1041:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:41:41] (03PS3) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [11:42:10] (03CR) 10Elukey: [C:03+1] "LGTM, we could also think about silently enabling the flag instead of hard failing, but it is also good to be explicit. Fine to proceed fo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087889 (owner: 10Muehlenhoff) [11:42:51] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087889 (owner: 10Muehlenhoff) [11:45:43] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4456/console" [puppet] - 10https://gerrit.wikimedia.org/r/1087371 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [11:48:17] (03PS1) 10MVernon: facts: adjust swift_disks fact to handle new SM kit [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) [11:48:47] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1087892 (https://phabricator.wikimedia.org/T378940) [11:49:27] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [11:49:28] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1087892 (https://phabricator.wikimedia.org/T378940) (owner: 10Marostegui) [11:49:57] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1087892 (https://phabricator.wikimedia.org/T378940) (owner: 10Marostegui) [11:50:58] (03Abandoned) 10Fabfur: haproxykafka: restart service on config file changes [puppet] - 10https://gerrit.wikimedia.org/r/1087371 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [11:52:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:53:18] (03PS4) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) [11:53:21] (03PS3) 10Abijeet Patro: tables-catalog: Add translate_cache table [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) [11:53:26] (03CR) 10Abijeet Patro: tables-catalog: Add translate_cache table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) (owner: 10Abijeet Patro) [11:53:28] (03CR) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [11:53:29] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:53:51] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:54:09] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:55:09] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:55:29] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:55:51] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:57:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:59:15] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087435 (owner: 10PipelineBot) [12:00:05] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1200). Please do the needful. [12:00:16] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087435 (owner: 10PipelineBot) [12:02:17] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:02:47] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:03:25] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:03:53] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:05:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P70957 and previous config saved to /var/cache/conftool/dbconfig/20241106-120536-arnaudb.json [12:06:22] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:06:57] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:08:10] (03PS1) 10Arnaudb: mariadb: disable notifications on pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1087893 [12:09:35] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db1206 quickly with 2 steps - repool [12:09:39] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1206 quickly with 2 steps - repool [12:09:47] (03CR) 10Arnaudb: [C:03+2] mariadb: disable notifications on pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1087893 (owner: 10Arnaudb) [12:14:08] (03CR) 10Muehlenhoff: [C:03+2] sre.hosts.provision: Turn virt warning into a hard error [cookbooks] - 10https://gerrit.wikimedia.org/r/1087889 (owner: 10Muehlenhoff) [12:21:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: testing [12:21:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1125.eqiad.wmnet with reason: testing [12:21:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2230.codfw.wmnet with reason: testing [12:21:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2230.codfw.wmnet with reason: testing [12:22:11] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:22:25] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:22:40] (03PS1) 10Volans: sre.mysql.pool: fix check for diff [cookbooks] - 10https://gerrit.wikimedia.org/r/1087895 [12:23:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10295844 (10ABran-WMF) >>! In T378146#10295607, @Marostegui wrote: >>>! In T378146#10293606, @Jhancock.wm wrote: >> es2043 > [[ https://netbox.wikimedia... [12:23:19] !log arnaudb@cumin1002 dbctl commit (dc=all): '"db1206 pending"', diff saved to https://phabricator.wikimedia.org/P70959 and previous config saved to /var/cache/conftool/dbconfig/20241106-122318-arnaudb.json [12:23:37] !log Migrate db1125 to MariaDB 10.6.20 T378940 [12:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:39] T378940: Compile and package MariaDB 10.11.10 and 10.6.20 - https://phabricator.wikimedia.org/T378940 [12:23:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 depool to test cookbook hotfix on CR 1087895', diff saved to https://phabricator.wikimedia.org/P70960 and previous config saved to /var/cache/conftool/dbconfig/20241106-122348-arnaudb.json [12:25:00] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db1206 quickly with 2 steps - test 1087895 [12:25:42] (03CR) 10Arnaudb: "tested on db1206 →" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087895 (owner: 10Volans) [12:25:49] (03CR) 10Arnaudb: [C:03+1] sre.mysql.pool: fix check for diff [cookbooks] - 10https://gerrit.wikimedia.org/r/1087895 (owner: 10Volans) [12:27:11] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:27:25] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:27:35] sorry for the noise [12:28:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10295858 (10Marostegui) I would suggest to update the task so this new racking proposal is clearer. [12:30:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10295863 (10ABran-WMF) [12:31:21] (03CR) 10Volans: [C:03+2] sre.mysql.pool: fix check for diff [cookbooks] - 10https://gerrit.wikimedia.org/r/1087895 (owner: 10Volans) [12:34:23] (03PS1) 10Slyngshede: Upgrade CAS and enable Redis TGT [dns] - 10https://gerrit.wikimedia.org/r/1087896 (https://phabricator.wikimedia.org/T377728) [12:37:05] (03Merged) 10jenkins-bot: sre.mysql.pool: fix check for diff [cookbooks] - 10https://gerrit.wikimedia.org/r/1087895 (owner: 10Volans) [12:40:10] (03CR) 10Nikerabbit: [C:03+1] tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [12:40:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1206 quickly with 2 steps - test 1087895 [12:40:19] (03CR) 10Nikerabbit: [C:03+1] tables-catalog: Add translate_cache table [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) (owner: 10Abijeet Patro) [12:40:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, idp2004 appears to work fine" [dns] - 10https://gerrit.wikimedia.org/r/1087896 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [12:41:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to drbd [12:41:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295895 (10ops-monitoring-bot) VM ml-etcd1001.eqiad.wmnet switching disk type to drbd [12:43:27] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:45:56] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2236 [puppet] - 10https://gerrit.wikimedia.org/r/1087202 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [12:50:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to drbd [12:52:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [12:52:13] (03CR) 10Slyngshede: [C:03+2] Upgrade CAS and enable Redis TGT [dns] - 10https://gerrit.wikimedia.org/r/1087896 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [12:52:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295908 (10ops-monitoring-bot) Draining ganeti1014.eqiad.wmnet of running VMs [12:52:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [12:52:53] !log IDP/CAS-SSO Enable Redis TGT backend [12:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: provisionning db2236.codfw.wmnet - T373579 [12:54:43] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [12:54:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: provisionning db2236.codfw.wmnet - T373579 [12:54:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: provisionning db2236.codfw.wmnet - T373579 [12:55:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: provisionning db2236.codfw.wmnet - T373579 [12:55:15] !log arnaudb@cumin1002 START - Cookbook sre.mysql.depool db2136 - depooling db2136 to clone on db2236 [12:55:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2136 - depooling db2136 to clone on db2236 [12:55:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1001.eqiad.wmnet to plain [12:56:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295921 (10ops-monitoring-bot) VM ml-etcd1001.eqiad.wmnet switching disk type to plain [12:56:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2136 in db2236 for T373579', diff saved to https://phabricator.wikimedia.org/P70964 and previous config saved to /var/cache/conftool/dbconfig/20241106-125648-arnaudb.json [12:56:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1001.eqiad.wmnet to plain [12:58:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to drbd [13:00:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10295930 (10ops-monitoring-bot) VM dse-k8s-etcd1002.eqiad.wmnet switching disk type to drbd [13:02:17] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2136.codfw.wmnet onto db2236.codfw.wmnet [13:04:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to drbd [13:16:35] (03PS1) 10Arnaudb: mariadb: productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) [13:19:11] (03PS1) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [13:20:01] (03PS2) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [13:20:55] (03CR) 10CI reject: [V:04-1] airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [13:24:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:17] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:27:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [13:27:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1041.eqiad.wmnet [13:27:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [13:28:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10296011 (10ops-monitoring-bot) Draining ganeti1014.eqiad.wmnet of running VMs [13:28:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [13:30:02] (03CR) 10Hnowlan: [C:03+1] replace list of cassandra hosts with faux values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [13:33:38] (03PS4) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [13:34:14] (03CR) 10CI reject: [V:04-1] openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [13:38:07] !incidents [13:38:08] 5375 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:08] 5374 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:08] 5373 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:08] 5372 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:09] 5371 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:09] 5370 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:09] 5369 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:10] 5368 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:38:10] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [13:39:22] (03PS3) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [13:41:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to plain [13:42:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:43:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:43:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10296070 (10ops-monitoring-bot) VM dse-k8s-etcd1002.eqiad.wmnet switching disk type to plain [13:44:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to plain [13:47:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1044.eqiad.wmnet to cluster eqiad and group B [13:50:30] (03PS4) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [13:52:09] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1044.eqiad.wmnet to cluster eqiad and group B [13:52:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [13:52:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10296112 (10MoritzMuehlenhoff) [13:52:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10296116 (10ops-monitoring-bot) Draining ganeti1014.eqiad.wmnet of running VMs [13:53:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:20] (03PS1) 10Muehlenhoff: Add ganeti1045/ganeti1046 as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1087912 (https://phabricator.wikimedia.org/T378921) [13:57:10] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1045/ganeti1046 as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1087912 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [13:57:29] (03CR) 10Marostegui: [C:04-1] "Needs to be included in s2" [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [13:58:25] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mcrouter-exporter.service on idp2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1400). [14:00:05] Hamishcz and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:45] (03PS5) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [14:01:06] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: update to 2024-11-05-170900-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087557 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [14:01:40] (03PS1) 10Slyngshede: P:idp enable Redis TGT for all of production. [puppet] - 10https://gerrit.wikimedia.org/r/1087913 (https://phabricator.wikimedia.org/T377728) [14:02:27] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10296152 (10ABran-WMF) it seems that the issue occured again [[ https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-data-persistence/202411... [14:02:40] !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad for service: ncredir-addrs [reason: no reason specified, T378453] [14:02:44] T378453: Testing liberica with ncredir@eqiad - https://phabricator.wikimedia.org/T378453 [14:02:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad for service: ncredir-addrs [reason: no reason specified, T378453] [14:03:07] o/ [14:03:09] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4457/console" [puppet] - 10https://gerrit.wikimedia.org/r/1087913 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:03:11] I can deploy! [14:04:16] waiting for Hamishcz to show up, I think [14:04:26] my backport will take a long time so I’d prefer not to do that first [14:04:30] (though I guess i can still +2 it) [14:04:36] (03PS2) 10Dzahn: admin: add group approvers for druid-admins, htmldumps-admin, udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) [14:04:49] (03CR) 10Dzahn: admin: add group approvers for druid-admins, htmldumps-admin, udp2log-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [14:04:59] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087877 (https://phabricator.wikimedia.org/T323778) (owner: 10Lucas Werkmeister (WMDE)) [14:05:03] (03CR) 10Dzahn: "Ok, gotcha!" [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [14:06:56] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4458/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087913 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:09:40] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1046:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10296184 (10Jhancock.wm) it is! must have overlooked it. thanks! [14:10:18] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T379116#10296181 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:10:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10296186 (10Jhancock.wm) [14:10:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10296188 (10Jhancock.wm) a:03Jhancock.wm [14:10:55] RESOLVED: [3x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1046:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:04] Sorry i was stuck by traffic jam, is the windows still available pls? [14:13:36] for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1085572 [14:13:56] yes! hi! [14:14:18] Lucas_WMDE: hi! sorry for waiting, again :) [14:14:35] I found your +1 and was going to ping u lol [14:14:38] (03PS1) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) [14:14:51] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1045 [14:15:23] (03PS1) 10Ayounsi: Policy BGP_Infra_In: add term prioritize_experimental [homer/public] - 10https://gerrit.wikimedia.org/r/1087918 (https://phabricator.wikimedia.org/T378453) [14:15:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:15:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 (owner: 10Hamish) [14:15:54] (03PS2) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) [14:16:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1045 [14:16:15] (03Merged) 10jenkins-bot: Cleanup for logo related file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085572 (owner: 10Hamish) [14:16:43] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1085572|Cleanup for logo related file]] [14:16:58] Lucas_WMDE: Brilliant, and thank you! [14:16:59] (03CR) 10Cathal Mooney: [C:03+1] Policy BGP_Infra_In: add term prioritize_experimental [homer/public] - 10https://gerrit.wikimedia.org/r/1087918 (https://phabricator.wikimedia.org/T378453) (owner: 10Ayounsi) [14:17:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:18:23] (03CR) 10Gmodena: [C:03+1] hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:18:30] (03CR) 10Ayounsi: [C:03+2] Policy BGP_Infra_In: add term prioritize_experimental [homer/public] - 10https://gerrit.wikimedia.org/r/1087918 (https://phabricator.wikimedia.org/T378453) (owner: 10Ayounsi) [14:19:03] (03Merged) 10jenkins-bot: Policy BGP_Infra_In: add term prioritize_experimental [homer/public] - 10https://gerrit.wikimedia.org/r/1087918 (https://phabricator.wikimedia.org/T378453) (owner: 10Ayounsi) [14:19:15] PROBLEM - Host cp2031 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:18] huh [14:19:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet [14:19:34] !log lucaswerkmeister-wmde@deploy2002 hamishz, lucaswerkmeister-wmde: Backport for [[gerrit:1085572|Cleanup for logo related file]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:19:41] !log depool cp2031 [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:49] Hamishcz: can you check that the logos still look like they’re supposed to? [14:19:55] sukhe: I can take a look at mgmt if you are already busy [14:19:59] (IIUC there should be no change from this deployment, it’s just cleaning up the order) [14:20:05] mutante: thanks but can look as well [14:20:08] ok [14:20:34] (03CR) 10Vgutierrez: "I'd go a step further and set the topics on hieradata/role/common/ rather than per DC" [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:20:56] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet [14:20:59] RECOVERY - Host cp2031 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [14:21:48] mutante: thanks for the offer <3 [14:22:07] sukhe: yw, looks like it rebooted itself? [14:22:29] Description: The power input for power supply 2 is lost. [14:22:57] maybe check if dcops-codfw is on site [14:23:21] btw there was already a broken PSU on this very host in the past https://phabricator.wikimedia.org/T335110 [14:23:28] (03Merged) 10jenkins-bot: Document available wbformatvalue options [extensions/Wikibase] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1087877 (https://phabricator.wikimedia.org/T323778) (owner: 10Lucas Werkmeister (WMDE)) [14:23:41] ^ I’ll deploy that backport soon, hopefully [14:23:50] Lucas_WMDE: doing, need to check w/ files so may need some time [14:24:08] sukhe: you would think losing one of 2 PSUs doesnt cause this :/ [14:24:58] mutante: yeah, Jenn is on site so she is looking [14:24:58] maybe it was the power cable after all. unfortunately the ticket doesnt say how it was fixed [14:25:02] cool [14:26:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet [14:26:55] Lucas_WMDE: Confirm the logos are good [14:27:00] !log lucaswerkmeister-wmde@deploy2002 hamishz, lucaswerkmeister-wmde: Continuing with sync [14:27:09] ok, thanks! [14:27:24] my pleasure:) [14:28:01] (03CR) 10Elukey: "The approach looks good, and it seems working but to be honest it seems to make the whole readability and maintainability to decrease (eve" [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [14:28:12] (03PS1) 10FNegri: Add komla to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/1087919 (https://phabricator.wikimedia.org/T379159) [14:30:56] (03PS3) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) [14:31:23] !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad for service: ncredir-addrs [reason: no reason specified, T378453] [14:31:25] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad for service: ncredir-addrs [reason: no reason specified, T378453] [14:31:30] T378453: Testing liberica with ncredir@eqiad - https://phabricator.wikimedia.org/T378453 [14:31:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085572|Cleanup for logo related file]] (duration: 15m 01s) [14:31:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:31:59] (03CR) 10Elukey: "The regex could be simplified even more to (ata-\d.\d|scsi-\d:\d:\d{2}:\d)-part4" [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [14:33:14] I’m starting my backport now, since it was already merged [14:33:22] it will probably take more than 30 minutes, because it touches i18n :( [14:33:28] sorry in advance [14:33:58] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1087877|Document available wbformatvalue options (T323778)]] [14:34:01] T323778: [ACTION-API] [TECH] Wikibase doesn’t validate formatter options, can crash with different TypeErrors - https://phabricator.wikimedia.org/T323778 [14:34:19] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: set max poll interval and revert concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087558 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:49] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1046 [14:40:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087913 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:42:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1046 [14:43:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet [14:45:55] FIRING: MaxConntrack: Max conntrack at 95.68% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:46:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10296409 (10MoritzMuehlenhoff) [14:47:12] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10296412 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done! [14:47:33] (03CR) 10Brouberol: [C:03+1] spark: Avoid Ferm-specific syntax (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1087488 (owner: 10Muehlenhoff) [14:48:32] scap is currently at sync-testservers-k8s [14:48:39] !log installing usb.ids updates from Bookworm point release [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] taking longer than usual, probably because the image diff is bigger due to the l10n rebuild :/ [14:50:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1046.eqiad.wmnet [14:50:55] RESOLVED: MaxConntrack: Max conntrack at 92.08% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:51:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10296453 (10MoritzMuehlenhoff) [14:51:27] !log installing php7.4 security updates [14:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:19] now at scap-cdb-rebuild (still in the test servers) [14:59:40] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1087877|Document available wbformatvalue options (T323778)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:59:43] yay [14:59:43] T323778: [ACTION-API] [TECH] Wikibase doesn’t validate formatter options, can crash with different TypeErrors - https://phabricator.wikimedia.org/T323778 [14:59:44] testing… [15:00:05] works \o/ [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1500) [15:00:06] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:02:28] (03CR) 10Eevans: [C:03+2] replace list of cassandra hosts with faux values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [15:03:16] (03CR) 10Eevans: [C:03+2] replace list of cassandra hosts with faux values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [15:04:34] still deploying, sorry :( [15:04:42] (03Merged) 10jenkins-bot: replace list of cassandra hosts with faux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087514 (owner: 10Eevans) [15:05:41] (03PS1) 10Dzahn: admin: add group approver for udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1087926 (https://phabricator.wikimedia.org/T276465) [15:06:03] (03PS2) 10MVernon: facts: adjust swift_disks fact to handle new SM kit [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) [15:06:14] (03PS1) 10Esanders: Deploy EditCheck (references) to hiwiki, bnwiki, idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087927 (https://phabricator.wikimedia.org/T366381) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2136.codfw.wmnet onto db2236.codfw.wmnet [15:08:35] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [15:09:04] (03CR) 10Elukey: [C:03+1] facts: adjust swift_disks fact to handle new SM kit [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [15:09:53] (03CR) 10MVernon: [C:03+2] facts: adjust swift_disks fact to handle new SM kit [puppet] - 10https://gerrit.wikimedia.org/r/1087891 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [15:11:31] (03PS2) 10Abijeet Patro: Translate: Enable message bundle Scribunto module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087914 (https://phabricator.wikimedia.org/T359918) [15:12:43] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087877|Document available wbformatvalue options (T323778)]] (duration: 38m 45s) [15:12:46] T323778: [ACTION-API] [TECH] Wikibase doesn’t validate formatter options, can crash with different TypeErrors - https://phabricator.wikimedia.org/T323778 [15:13:08] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [15:14:02] (03PS2) 10Arnaudb: mariadb: productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) [15:14:22] (03CR) 10Arnaudb: "good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [15:18:42] !log gitlab1004 - systemctl start wmf_auto_restart_ssh-gitlab (because it had failed with "Service ssh-gitlab not present or not running") but now it's just fine and exits with "No restart necessary" T379166 [15:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:44] T379166: SystemdUnitFailed - wmf_auto_restart_ssh-gitlab.service on gitlab1004:9100 - https://phabricator.wikimedia.org/T379166 [15:19:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:29] (03CR) 10Nikerabbit: [C:03+1] Translate: Enable message bundle Scribunto module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087914 (https://phabricator.wikimedia.org/T359918) (owner: 10Abijeet Patro) [15:21:26] * Lucas_WMDE done deploying [15:23:15] (03PS1) 10Slyngshede: P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) [15:24:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db2136 gradually with 4 steps - cloned on db2236 [15:26:02] (03CR) 10CI reject: [V:04-1] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [15:27:32] (03PS1) 10Clément Goubert: Revert "profile::docker::report: use the internal registry endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/1087932 [15:28:45] (03CR) 10Alexandros Kosiaris: [C:03+1] Revert "profile::docker::report: use the internal registry endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/1087932 (owner: 10Clément Goubert) [15:29:46] (03CR) 10Clément Goubert: [C:03+2] Revert "profile::docker::report: use the internal registry endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/1087932 (owner: 10Clément Goubert) [15:31:48] !log installing Linux 5.10.226 on bullseye hosts [15:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:08] (03PS1) 10Andrea Denisse: titan: bring thanos 5m retention to 44w [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) [15:33:34] (03PS2) 10Slyngshede: P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) [15:33:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10296585 (10bking) DC Ops, Per IRC conversation in dc-ops channel , Cathal checked the network plumbing and everything looks good. `bmc-info` fro... [15:33:49] (03CR) 10Dzahn: [C:03+1] "confirmed by Leo on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1087926 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [15:33:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087926 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [15:34:28] (03CR) 10Dzahn: [C:03+2] admin: add group approver for udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1087926 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [15:35:55] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:14] (03CR) 10CI reject: [V:04-1] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [15:36:19] (03CR) 10LMata: [V:03+1] admin: add group approver for udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1087926 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [15:37:13] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10296591 (10Deepesha_WMDE) @MatthewVernon here is my username: Deepesha_WMDE. Please let me know if you need anything else from me. Thanks! [15:38:10] (03PS2) 10Ssingh: hiera: set profile::lvs::do_ipv6_ra_primary on lvs4010 [puppet] - 10https://gerrit.wikimedia.org/r/1082257 (https://phabricator.wikimedia.org/T358260) [15:39:54] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10296594 (10bking) [15:41:25] (03CR) 10Ssingh: [C:03+2] hiera: set profile::lvs::do_ipv6_ra_primary on lvs4010 [puppet] - 10https://gerrit.wikimedia.org/r/1082257 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [15:41:31] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#10296614 (10Dzahn) [15:42:37] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:43:08] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:48:19] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:48:49] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:49:35] (03PS4) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) [15:50:35] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:50:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [15:51:05] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:53:53] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [15:55:03] !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet [15:55:04] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on cr[3-4]-ulsfo with reason: prevent bgp alerts firing while lvs4010 is rebooted [15:55:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on cr[3-4]-ulsfo with reason: prevent bgp alerts firing while lvs4010 is rebooted [15:55:37] !log rebooting lvs4010 to verify new IPv6 sysctl's for RA processing work T358260 [15:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:40] T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260 [15:57:21] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@294093b]: remove section alignment image suggestions, now in section topics v1.0.0 [15:57:28] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransc1001 - vriley@cumin1002" [15:57:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransc1001 - vriley@cumin1002" [15:57:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:57:58] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@294093b]: remove section alignment image suggestions, now in section topics v1.0.0 (duration: 01m 23s) [15:58:13] (03CR) 10Jcrespo: [C:04-1] "This needs test-s4 extensive testing, blocking merge until done." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [15:58:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:59:20] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:01:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet [16:01:43] (03PS1) 10MVernon: swift: use regex to handle Dell & SM accounts|containers disks [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) [16:02:02] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [16:08:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:08:59] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:10:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:10:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2136 gradually with 4 steps - cloned on db2236 [16:10:45] (03PS1) 10Vgutierrez: liberica: Harden healthcheck systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087939 [16:11:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:13:35] (03Abandoned) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087917 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [16:14:10] (03CR) 10Cwhite: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:14:31] (03PS2) 10MVernon: swift: use regex to handle Dell & SM accounts|containers disks [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) [16:14:41] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [16:15:37] (03PS5) 10FNegri: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) [16:16:05] (03PS1) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) [16:16:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:16:43] (03PS3) 10MVernon: swift: use regex to handle Dell & SM accounts|containers disks [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) [16:16:50] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:16:51] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [16:16:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [16:17:11] (03CR) 10CI reject: [V:04-1] WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [16:17:14] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [16:18:11] (03PS2) 10Vgutierrez: liberica: Harden healthcheck systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087939 [16:18:11] (03PS1) 10Vgutierrez: liberica: Harden cp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087941 [16:18:31] (03CR) 10Ottomata: hiera: split haproxykafka topics based on role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [16:18:44] (03PS3) 10Gmodena: refinery: gobblin: add webrequest_frontend. [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) [16:18:59] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 176741 MB (4% inode=92%): /srv/swift-storage/sdc1 186571 MB (4% inode=91%): /srv/swift-storage/sdf1 229716 MB (6% inode=91%): /srv/swift-storage/sdg1 206647 MB (5% inode=91%): /srv/swift-storage/sdd1 193936 MB (5% inode=91%): /srv/swift-storage/sde1 199289 MB (5% inode=92%): /srv/swift-storage/sdi1 184636 MB (4% inode=91%): /srv/swift-st [16:18:59] k1 181125 MB (4% inode=92%): /srv/swift-storage/sdj1 193339 MB (5% inode=91%): /srv/swift-storage/sdl1 182177 MB (4% inode=91%): /srv/swift-storage/sdm1 189422 MB (4% inode=91%): /srv/swift-storage/sdn1 152482 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [16:19:00] (03PS2) 10Vgutierrez: liberica: Harden cp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087941 [16:19:01] (03PS4) 10MVernon: swift: use regex to handle Dell & SM accounts|containers disks [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) [16:19:08] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [16:20:29] (03PS2) 10Fabfur: hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) [16:20:36] (03CR) 10Fabfur: hiera: split haproxykafka topics based on role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [16:20:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for fransc1001 - jclark@cumin1002" [16:20:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for fransc1001 - jclark@cumin1002" [16:20:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:57] (03CR) 10Ssingh: [C:03+1] hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [16:21:23] (03CR) 10Fabfur: [C:03+2] hiera: split haproxykafka topics based on role [puppet] - 10https://gerrit.wikimedia.org/r/1087940 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [16:22:13] (03CR) 10MVernon: "I'm not sure if the commit message or code change are what you were aiming for, but I don't think they match..." [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:23:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:24:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:24:55] (03CR) 10Elukey: [C:03+1] swift: use regex to handle Dell & SM accounts|containers disks [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [16:25:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10296744 (10MoritzMuehlenhoff) [16:25:15] (03CR) 10MVernon: [C:03+2] swift: use regex to handle Dell & SM accounts|containers disks [puppet] - 10https://gerrit.wikimedia.org/r/1087935 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [16:25:38] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [16:26:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:26:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:28:43] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:11] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [16:31:28] (03PS1) 10Fabfur: hiera: moving haproxykafka common keys to profile [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) [16:31:43] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [16:32:34] !log remove ganeti1014 from active ganeti nodes T378921 [16:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:44] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [16:32:54] (03CR) 10Cwhite: titan: bring thanos 5m retention to 44w (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:33:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10296782 (10MoritzMuehlenhoff) [16:34:24] (03PS1) 10Vgutierrez: liberica: Harden fp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087944 [16:35:05] PROBLEM - ganeti-confd running on ganeti1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:35:19] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:35:23] PROBLEM - ganeti-noded running on ganeti1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:36:14] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:37:19] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:37:37] FIRING: ProbeDown: Service ganeti1014:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173 (10Kgraessle) 03NEW [16:38:37] (03CR) 10Aqu: [C:03+1] refinery: gobblin: add webrequest_frontend. [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) (owner: 10Gmodena) [16:39:38] (03CR) 10Xcollazo: [C:03+1] "LGTM. Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [16:40:07] (03CR) 10Xcollazo: [C:03+1] "CCing @btullis@wikimedia.org FYI." [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [16:43:31] (03PS2) 10Andrea Denisse: titan: bring thanos 5m retention to 10w [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) [16:44:19] (03CR) 10Andrea Denisse: titan: bring thanos 5m retention to 10w (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:44:21] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296813 (10Papaul) [16:45:42] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296815 (10Papaul) [16:45:42] (03CR) 10Cwhite: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:45:58] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296817 (10Papaul) [16:52:26] (03CR) 10MVernon: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:52:47] (03CR) 10Andrea Denisse: [C:03+2] titan: bring thanos 5m retention to 10w [puppet] - 10https://gerrit.wikimedia.org/r/1087933 (https://phabricator.wikimedia.org/T351927) (owner: 10Andrea Denisse) [16:55:26] (03PS4) 10Dzahn: admin: add group approvers for druid-admins and htmldumps-admin [puppet] - 10https://gerrit.wikimedia.org/r/1087575 (https://phabricator.wikimedia.org/T276465) [16:58:53] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [17:04:00] (03CR) 10Ottomata: [C:03+1] "One nit. +1 otherwise. let me know if I can help merge when you are ready." [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) (owner: 10Gmodena) [17:05:34] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [17:06:52] (03PS1) 10MVernon: preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1087949 (https://phabricator.wikimedia.org/T371400) [17:07:26] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296916 (10ssingh) `cr1-eqiad` is stated for Nov 13 but note that T376737 is also scheduled for that period (Nov 13, 8 CT) and it might make tricky for both `magru` and `eqiad` to... [17:09:11] PROBLEM - Host ms-be2083 is DOWN: PING CRITICAL - Packet loss = 100% [17:09:41] RECOVERY - Host ms-be2083 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [17:09:51] (03PS6) 10FNegri: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) [17:11:07] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransw1001 - vriley@cumin1002" [17:11:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt fransw1001 - vriley@cumin1002" [17:11:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:11:25] (03CR) 10FNegri: WMCS: split cloudvirt alerts from generic nodes (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [17:11:34] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [17:12:26] (03PS1) 10Dzahn: backup: add /var/lib/gerrit to gerrit repodata backup filesets [puppet] - 10https://gerrit.wikimedia.org/r/1087950 (https://phabricator.wikimedia.org/T338470) [17:12:44] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296934 (10Papaul) [17:13:21] (03CR) 10Dzahn: "cc: Hashar: a good example how just changing the user name turns into a rabbit hole, but we can do it." [puppet] - 10https://gerrit.wikimedia.org/r/1087950 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:13:40] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296938 (10Papaul) @ssingh thanks i forgot about the 13th I update the dates. [17:13:53] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296937 (10Joe) I see there is a maintenance planned for codfw now, and that the plan is to depool the datacenter. Does this mean we're doing a datacenter switchover? Because oth... [17:14:34] (03CR) 10Daimona Eaytoy: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [17:14:37] (03PS1) 10Majavah: WMCS: Lookup IPv6 records more generally [puppet] - 10https://gerrit.wikimedia.org/r/1087951 [17:14:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [17:15:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:17:12] !log importing debs for mercurius-1.0.1 [17:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:18:56] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296958 (10akosiaris) > Upgrades should follow the standard process The standard process docs are outdated I fear. > Depool site (optional) > (optional) if codfw, drain mw traff... [17:18:57] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296959 (10Papaul) >>! In T364092#10296937, @Joe wrote: > I see there is a maintenance planned for codfw now, and that the plan is to depool the datacenter. Does this mean we're do... [17:18:58] RESOLVED: ProbeDown: Service ganeti1014:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:34] (03CR) 10Jcrespo: [C:03+1] "No issue on my side, looks fine, but please check/CC people if it could affect non production (beta or cloud envs)." [puppet] - 10https://gerrit.wikimedia.org/r/1087950 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:22:01] (03CR) 10Dzahn: "thank you! it will either be non-existing or be larger than 0 bytes, that is for sure since at least some config files will always be in i" [puppet] - 10https://gerrit.wikimedia.org/r/1087950 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:22:02] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296980 (10Papaul) Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can schedule the upgrade for codfw. [17:22:34] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10296985 (10Papaul) [17:27:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host mc-gp2006.codfw.wmnet with OS bookworm [17:27:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10297015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host mc-gp2006.codfw.wmnet with OS bookworm [17:28:03] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10297017 (10akosiaris) >>! In T364092#10296980, @Papaul wrote: > Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can sche... [17:29:56] (03CR) 10Dzahn: [C:03+2] backup: add /var/lib/gerrit to gerrit repodata backup filesets [puppet] - 10https://gerrit.wikimedia.org/r/1087950 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:30:46] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10297026 (10cmooney) >>! In T364092#10296958, @akosiaris wrote: > codfw will be the primary during that set of dates, it should NOT be depooled. Agreed. It should also be possible... [17:31:50] jouncebot: next [17:31:50] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1800) [17:32:08] (03CR) 10JHathaway: [C:03+1] preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1087949 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [17:32:51] (03CR) 10MVernon: [C:03+2] preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes [puppet] - 10https://gerrit.wikimedia.org/r/1087949 (https://phabricator.wikimedia.org/T371400) (owner: 10MVernon) [17:33:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297053 (10cmooney) @Jclark-ctr could you also let me know what ports on the fmsw these two were plugged into? |Device 1|Fron... [17:35:15] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [17:37:48] (03CR) 10Btullis: "Looks good. There are a couple of files that I think we could probably exclude and some settings that seem extraneous, but nothing big." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [17:44:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [17:45:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2006.codfw.wmnet with reason: host reimage [17:48:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2006.codfw.wmnet with reason: host reimage [17:58:39] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/1087951/4459/" [puppet] - 10https://gerrit.wikimedia.org/r/1087951 (owner: 10Majavah) [17:59:17] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch. Taking the liberty to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1087508 (owner: 10Muehlenhoff) [17:59:22] (03CR) 10Ssingh: [C:03+2] dns::auth::update: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1087508 (owner: 10Muehlenhoff) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1800) [18:03:00] !log dummy authdns-update to test CR 10857508 [18:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:06:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10297264 (10VRiley-WMF) [18:10:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [18:10:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2083.codfw.wmnet with OS bullseye [18:11:15] (03PS1) 10BCornwall: varnish: Increase RSA cert warnings to 5% of views [puppet] - 10https://gerrit.wikimedia.org/r/1087954 (https://phabricator.wikimedia.org/T370837) [18:11:52] (03CR) 10Ssingh: [C:03+1] varnish: Increase RSA cert warnings to 5% of views [puppet] - 10https://gerrit.wikimedia.org/r/1087954 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:17:26] (03CR) 10BCornwall: [V:03+2 C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1087954 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:19:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297284 (10phaultfinder) [18:25:38] (03CR) 10Ssingh: [C:03+1] "Hardening looks good. [nit] link bug T378341." [puppet] - 10https://gerrit.wikimedia.org/r/1087944 (owner: 10Vgutierrez) [18:25:53] (03CR) 10Ssingh: [C:03+1] "[nit] add Bug: T378341" [puppet] - 10https://gerrit.wikimedia.org/r/1087941 (owner: 10Vgutierrez) [18:25:56] (03CR) 10Ssingh: [C:03+1] "[nit] add Bug: T378341" [puppet] - 10https://gerrit.wikimedia.org/r/1087939 (owner: 10Vgutierrez) [18:28:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297301 (10cmooney) >>! In T377381#10250655, @Jgreen wrote: > There are 6 servers being replaced: > {T369565} > {T369947} > {T... [18:29:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [18:31:02] (03PS1) 10Scott French: shellbox-syntaxhighlight: 1 of 12 replicas on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087579 (https://phabricator.wikimedia.org/T377038) [18:34:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297318 (10phaultfinder) [18:39:32] (03CR) 10BCornwall: [C:03+2] idp: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075608 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [18:39:42] (03PS2) 10BCornwall: idp: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075608 (https://phabricator.wikimedia.org/T375569) [18:39:48] (03CR) 10BCornwall: [V:03+2 C:03+2] idp: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075608 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [18:41:27] !log Remove RSA cert support from P:idp clients (icinga, karma, klaxon, librenms, orchestrator) (T375569) [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:31] T375569: Remove RSA certificates from puppet - https://phabricator.wikimedia.org/T375569 [18:47:30] (03CR) 10BCornwall: "Marking unresolved." [puppet] - 10https://gerrit.wikimedia.org/r/1075604 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [18:50:59] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10297394 (10BCornwall) Hi, @violetwtf, has Thomas responded? Thanks for getting on this. :) [19:00:04] jnuche and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T1900). [19:06:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) (owner: 10Scardenasmolinar) [19:07:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) (owner: 10Scardenasmolinar) [19:19:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297470 (10phaultfinder) [19:21:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297478 (10cmooney) All, just to be aware I hit another snag this evening which may be problematic. When trying to configure... [19:23:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10297480 (10wiki_willy) Just a heads up @Jclark-ctr & @VRiley-WMF - the test controller kit should've arrived yesterday: https://www.fedex.com/fedextr... [19:26:03] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10297496 (10wiki_willy) Hi @Jhancock.wm and @Papaul - just a heads up, it looks like the test controller kit arrived yesterday: https://www.fedex.com/... [19:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297520 (10phaultfinder) [19:41:42] (03PS1) 10Dzahn: devtools: update gerrit user from gerrit2 to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1087963 (https://phabricator.wikimedia.org/T338470) [19:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297558 (10phaultfinder) [19:55:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:55:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2006.codfw.wmnet with OS bookworm [19:57:29] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10297564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host mc-gp2006.codfw.wmnet with OS bookworm completed: - mc-gp2006 (**WARN**... [19:57:44] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10297568 (10Jhancock.wm) [19:58:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10297569 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert this is ready for y'all [20:02:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297575 (10Jhancock.wm) [20:09:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297583 (10Dwisehaupt) > Thanks @Jgreen . Looking at the existing ports on the switch I think it might make sense if we chang... [20:16:51] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:19:00] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:22:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297595 (10Jclark-ctr) @cmooney replaced 1g dac cables with sfpt and cat6 cables. These two switches have been removed from... [20:24:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2146-50 to codfw - jhancock@cumin2002" [20:25:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2146-50 to codfw - jhancock@cumin2002" [20:25:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:25:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2146.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:25:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2147.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:25:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2148.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:25:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2149.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:25:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2150.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:26:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2149.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:27:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2149.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:32:42] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10297608 (10Jclark-ctr) 05Open→03Resolved @fnegri cpu2 and mainboard where replaced today [20:35:25] (03PS1) 10Dzahn: gerrit: add chown parameter to lfs data rsync, ensure daemon_user is used [puppet] - 10https://gerrit.wikimedia.org/r/1087967 (https://phabricator.wikimedia.org/T338470) [20:36:08] (03PS2) 10Dzahn: devtools: update gerrit user from gerrit2 to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1087963 (https://phabricator.wikimedia.org/T338470) [20:36:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2147.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:36:43] (03CR) 10Dzahn: [C:03+2] "cloud only and VM is currently down" [puppet] - 10https://gerrit.wikimedia.org/r/1087963 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [20:36:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2148.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:36:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2150.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2146.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:38] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10297622 (10Eevans) Ok, I think I'm going to be bold and say we just move forward with the version of the bot that requires you poke/nick highlight it. Bots that work this way are actually pretty... [20:37:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2149.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:38:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2146'] [20:38:40] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2147'] [20:38:41] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2148'] [20:38:43] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2149'] [20:38:44] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2150'] [20:39:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2146'] [20:39:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2147'] [20:39:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2148'] [20:39:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2149'] [20:39:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2150'] [20:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297623 (10phaultfinder) [20:40:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2146.codfw.wmnet with OS bookworm [20:40:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2147.codfw.wmnet with OS bookworm [20:40:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2149.codfw.wmnet with OS bookworm [20:40:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2150.codfw.wmnet with OS bookworm [20:40:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2148.codfw.wmnet with OS bookworm [20:40:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297624 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2146.codfw.wmnet with O... [20:40:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2147.codfw.wmnet with O... [20:40:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297626 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2149.codfw.wmnet with O... [20:40:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2150.codfw.wmnet with O... [20:40:36] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2148.codfw.wmnet with O... [20:44:47] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1087967/4462/" [puppet] - 10https://gerrit.wikimedia.org/r/1087967 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [20:45:58] (03PS1) 10Raymond Ndibe: openstack: keystone: fix radosgw 500 errors with Object Storage [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) [20:46:43] (03CR) 10Dzahn: [V:03+1 C:03+1] "tested by manually running the rsync command and does the job as it should. files on gerrit2003 will be owned gerrit:gerrit but on gerrit2" [puppet] - 10https://gerrit.wikimedia.org/r/1087967 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [20:50:33] (03CR) 10Raymond Ndibe: openstack: keystone: fix radosgw 500 errors with Object Storage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [20:53:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10297670 (10Jclark-ctr) @MoritzMuehlenhoff these where not handed over to service owner while. Luca and dceng researched License /provisioning issue. T... [20:53:31] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10297671 (10jhathaway) sounds good to me! [20:54:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10297675 (10cmooney) >>! In T377381#10297595, @Jclark-ctr wrote: > These two switches have been removed from racks. > asw2-d5-e... [20:57:46] (03CR) 10Raymond Ndibe: "Also someone should Add `profile::openstack::base::keystone::credential_key_0` to the cloudcontrol nodes (`1005`, `1006`, `1007`). I don't" [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [20:58:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2149.codfw.wmnet with reason: host reimage [20:58:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2146.codfw.wmnet with reason: host reimage [20:58:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2147.codfw.wmnet with reason: host reimage [20:59:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2148.codfw.wmnet with reason: host reimage [20:59:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2150.codfw.wmnet with reason: host reimage [20:59:32] (03PS3) 10Scott French: Add title-case mapping to support migration to PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T2100) [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:01:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2149.codfw.wmnet with reason: host reimage [21:05:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2146.codfw.wmnet with reason: host reimage [21:05:43] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10297699 (10Eevans) [21:06:04] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10297700 (10Eevans) Done: [[ https://gitlab.wikimedia.org/repos/sre/corto/-/commit/445e00ef3be5e47c73592122434f6e175b02f994 | corto/-/commit/445e00e ]] [21:06:18] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10297693 (10Eevans) 05Open→03Resolved a:03Eevans Done: [[ https://gitlab.wikimedia.org/repos/sre/corto/-/commit/445e00ef3be5e47c73592122434f6e175b02f994 | corto/-/commit/445e00e ]] [21:07:35] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10297701 (10Eevans) 05Open→03Resolved a:03Eevans [21:08:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2147.codfw.wmnet with reason: host reimage [21:08:44] (03CR) 10Scott French: "Tim, Timo - it would be great to get your review on this. All of this "makes sense" from a purely technical perspective (i.e., in terms of" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [21:12:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2148.codfw.wmnet with reason: host reimage [21:12:52] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet [reason: PSU replaced] [21:16:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2150.codfw.wmnet with reason: host reimage [21:18:14] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:19:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10297756 (10VRiley-WMF) We have recieved the test controller kit. We are ready to install it whenever you're ready! [21:20:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:20:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:22:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10297762 (10jcrespo) >>! In T371416#10276034, @jcrespo wrote: > Feel free to use and replace/service backup1012 and backup2012 as you want. [21:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10297766 (10phaultfinder) [21:25:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:25:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10297770 (10Jgreen) [21:26:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:26:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2149.codfw.wmnet with OS bookworm [21:26:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:26:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2146.codfw.wmnet with OS bookworm [21:26:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2149.codfw.wmnet with OS bo... [21:27:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2146.codfw.wmnet with OS bo... [21:27:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10297773 (10Jgreen) [21:27:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:27:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:27:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2147.codfw.wmnet with OS bookworm [21:27:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2147.codfw.wmnet with OS bo... [21:31:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:31:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:31:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2148.codfw.wmnet with OS bookworm [21:31:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2148.codfw.wmnet with OS bo... [21:32:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10297791 (10Jgreen) [21:35:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:42:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:42:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2150.codfw.wmnet with OS bookworm [21:43:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2150.codfw.wmnet with OS bo... [21:45:29] FIRING: SystemdUnitFailed: uwsgi-netbox.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:42] FIRING: JobUnavailable: Reduced availability for job netbox_global in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:46:20] FIRING: CirrusSearchPoolCounterRejectionTooHigh: MediaWiki CirrusSearch failing to obtain a token from the pool counter at a very high rate - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Pool_Counter_rejections_(search_is_currently_too_busy) - https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchPoolCounterRejectionToo [21:46:32] FIRING: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [21:46:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:46:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:47:10] !incidents [21:47:11] 5376 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [21:47:11] 5377 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [21:47:11] 5375 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:11] 5374 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:11] 5373 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:12] 5372 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:12] 5371 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:12] 5370 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:13] 5369 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:13] 5368 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:14] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [21:47:19] !ack 5376 [21:47:19] 5376 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [21:47:25] !ack 5377 [21:47:26] 5377 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [21:47:31] Here. [21:48:24] I'm officially out sick but nearby, looking [21:50:25] FIRING: [7x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:20] RESOLVED: CirrusSearchPoolCounterRejectionTooHigh: MediaWiki CirrusSearch failing to obtain a token from the pool counter at a very high rate - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Pool_Counter_rejections_(search_is_currently_too_busy) - https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchPoolCounterRejectionT [21:51:32] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [21:51:44] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:51:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:55:25] FIRING: [9x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241106T2200) [22:00:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:26] (03PS1) 10Ebernhardson: search: Exclude automated pool counter from alerts [alerts] - 10https://gerrit.wikimedia.org/r/1087973 [22:03:13] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:03:14] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:03:40] !incidents [22:03:40] 5378 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [22:03:41] 5379 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:03:41] 5377 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:03:41] 5376 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [22:03:41] 5375 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:41] 5374 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:42] 5373 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:42] 5372 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:43] 5371 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:43] 5370 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:44] 5369 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:44] 5368 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:45] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:03:49] !ack 5378 [22:03:49] 5378 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [22:03:54] !ack 5379 [22:03:54] 5379 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:04:26] FIRING: CirrusSearchPoolCounterRejectionTooHigh: MediaWiki CirrusSearch failing to obtain a token from the pool counter at a very high rate - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Pool_Counter_rejections_(search_is_currently_too_busy) - https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchPoolCounterRejectionToo [22:04:51] FIRING: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [22:05:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:42] RESOLVED: JobUnavailable: Reduced availability for job netbox_global in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:08:13] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:08:14] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:09:20] RESOLVED: CirrusSearchPoolCounterRejectionTooHigh: MediaWiki CirrusSearch failing to obtain a token from the pool counter at a very high rate - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Pool_Counter_rejections_(search_is_currently_too_busy) - https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchPoolCounterRejectionT [22:09:51] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [22:10:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:32] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:11:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297878 (10Jhancock.wm) [22:14:20] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for mc-gp1004 - jclark@cumin1002" [22:14:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for mc-gp1004 - jclark@cumin1002" [22:14:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:15:25] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host mc-gp1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:16:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host mc-gp1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:16:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host mc-gp1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:16:58] !incidents [22:16:59] 5379 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:16:59] 5378 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [22:16:59] 5377 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:16:59] 5376 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [22:16:59] 5375 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:00] 5374 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:00] 5373 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:00] 5372 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:01] 5371 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:01] 5370 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:02] 5369 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:02] 5368 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:17:03] 5367 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [22:18:12] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:18:34] (03PS1) 10Bvibber: DB config for testcommonswiki deployment for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) [22:19:16] (03CR) 10CI reject: [V:04-1] DB config for testcommonswiki deployment for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) (owner: 10Bvibber) [22:20:19] (03PS2) 10Bvibber: DB config for testcommonswiki deployment for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) [22:20:25] RESOLVED: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2151-55 to codfw - jhancock@cumin2002" [22:22:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2151-55 to codfw - jhancock@cumin2002" [22:22:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:23:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2151.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2152.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2153.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2154.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2155.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:24:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2153.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:24:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2155.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2153.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10297909 (10Jclark-ctr) [22:25:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2155.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10297911 (10VRiley-WMF) [22:33:54] (03PS1) 10Eevans: Add (fake) corto bot password [labs/private] - 10https://gerrit.wikimedia.org/r/1087979 (https://phabricator.wikimedia.org/T379204) [22:33:59] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: corto: update production deployment for project changes - https://phabricator.wikimedia.org/T379204 (10Eevans) 03NEW [22:34:17] (03CR) 10Eevans: [V:03+2 C:03+2] Add (fake) corto bot password [labs/private] - 10https://gerrit.wikimedia.org/r/1087979 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [22:35:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2154.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:35:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2151.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:35:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2152.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:36:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2155.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:36:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2153.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2155'] [22:38:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2154'] [22:38:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2153'] [22:38:02] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2152'] [22:38:03] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2151'] [22:38:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2151'] [22:38:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2152'] [22:38:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2153'] [22:38:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2154'] [22:38:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2155'] [22:38:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:38:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:39:12] (03PS1) 10Eevans: Update corto puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) [22:39:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2151.codfw.wmnet with OS bookworm [22:39:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2152.codfw.wmnet with OS bookworm [22:39:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2153.codfw.wmnet with OS bookworm [22:39:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2154.codfw.wmnet with OS bookworm [22:39:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2155.codfw.wmnet with OS bookworm [22:39:48] (03CR) 10CI reject: [V:04-1] Update corto puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [22:39:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2151.codfw.wmnet with O... [22:39:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2152.codfw.wmnet with O... [22:39:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297956 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2153.codfw.wmnet with O... [22:40:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297957 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2154.codfw.wmnet with O... [22:40:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10297958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2155.codfw.wmnet with O... [22:40:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-gp1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:43:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host mc-gp1006.eqiad.wmnet with OS bookworm [22:44:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10297962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host mc-gp1006.eqiad.wmnet with OS bookworm [22:44:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host mc-gp1005.eqiad.wmnet with OS bookworm [22:44:05] (03PS2) 10Eevans: Update corto puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) [22:44:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10297964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host mc-gp1005.eqiad.wmnet with OS bookworm [22:44:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host mc-gp1004.eqiad.wmnet with OS bookworm [22:44:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10297965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host mc-gp1004.eqiad.wmnet with OS bookworm [22:46:13] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [22:57:05] (03CR) 10Bking: [C:03+2] search: Exclude automated pool counter from alerts [alerts] - 10https://gerrit.wikimedia.org/r/1087973 (owner: 10Ebernhardson) [22:58:03] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [22:58:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2155.codfw.wmnet with reason: host reimage [22:58:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2154.codfw.wmnet with reason: host reimage [22:58:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2151.codfw.wmnet with reason: host reimage [22:58:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2152.codfw.wmnet with reason: host reimage [22:58:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2153.codfw.wmnet with reason: host reimage [23:00:15] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1006.eqiad.wmnet with reason: host reimage [23:00:40] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1004.eqiad.wmnet with reason: host reimage [23:02:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2155.codfw.wmnet with reason: host reimage [23:02:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1005.eqiad.wmnet with reason: host reimage [23:05:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1004.eqiad.wmnet with reason: host reimage [23:08:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2153.codfw.wmnet with reason: host reimage [23:12:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1005.eqiad.wmnet with reason: host reimage [23:15:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2154.codfw.wmnet with reason: host reimage [23:18:55] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:19:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2151.codfw.wmnet with reason: host reimage [23:19:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:41] (03CR) 10Scott French: [C:03+2] "Thanks for the clean-up!" [puppet] - 10https://gerrit.wikimedia.org/r/1087606 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [23:22:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1006.eqiad.wmnet with reason: host reimage [23:23:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:23:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2155.codfw.wmnet with OS bookworm [23:23:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:23:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10298075 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2155.codfw.wmnet with OS bo... [23:23:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:23:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1004.eqiad.wmnet with OS bookworm [23:24:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10298076 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host mc-gp1004.eqiad.wmnet with OS bookworm completed: - mc-gp1004 (**PASS**) -... [23:25:12] (03PS1) 10Aude: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) [23:25:17] (03CR) 10Eevans: "I'm not sure what is causing the PCC failures. I did add `profile::corto::irc_config::password` to labs/private, but it doesn't seem to f" [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [23:25:54] (03CR) 10CI reject: [V:04-1] Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:26:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2152.codfw.wmnet with reason: host reimage [23:27:57] (03PS2) 10Aude: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) [23:27:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:28:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:28:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2153.codfw.wmnet with OS bookworm [23:29:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10298086 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2153.codfw.wmnet with OS bo... [23:30:03] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:31:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:31:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1005.eqiad.wmnet with OS bookworm [23:31:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10298090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host mc-gp1005.eqiad.wmnet with OS bookworm completed: - mc-gp1005 (**PASS**) -... [23:34:05] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:36:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:36:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2154.codfw.wmnet with OS bookworm [23:36:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10298098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2154.codfw.wmnet with OS bo... [23:37:56] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:39:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:39:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2151.codfw.wmnet with OS bookworm [23:39:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10298108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2151.codfw.wmnet with OS bo... [23:41:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:41:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:41:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1006.eqiad.wmnet with OS bookworm [23:41:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10298120 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host mc-gp1006.eqiad.wmnet with OS bookworm completed: - mc-gp1006 (**PASS**) -... [23:45:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:46:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:46:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2152.codfw.wmnet with OS bookworm [23:46:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10298135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2152.codfw.wmnet with OS bo... [23:46:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10298136 (10Jhancock.wm) [23:50:54] (03CR) 10Jdlrobson: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:54:11] (03PS3) 10Aude: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) [23:54:54] (03CR) 10CI reject: [V:04-1] Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:55:01] (03CR) 10Aude: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude)