[01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T0100) [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:03:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:04:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:04:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:05:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.20 [core] (wmf/1.39.0-wmf.20) - 10https://gerrit.wikimedia.org/r/812976 [02:07:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.20 [core] (wmf/1.39.0-wmf.20) - 10https://gerrit.wikimedia.org/r/812976 (owner: 10TrainBranchBot) [02:25:16] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.20 [core] (wmf/1.39.0-wmf.20) - 10https://gerrit.wikimedia.org/r/812976 (owner: 10TrainBranchBot) [02:31:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:33:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:33:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:34:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:14:26] (03CR) 10Varnent: [C: 03+1] mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [04:16:13] (03PS7) 10Varnent: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [04:24:21] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T312827 (10Bethany) [04:24:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Bethany) [05:18:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 20 hosts with reason: Primary switchover s3 T311610 [05:19:05] T311610: Switchover s3 master db1123 -> db1157 - https://phabricator.wikimedia.org/T311610 [05:19:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 20 hosts with reason: Primary switchover s3 T311610 [05:19:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1157 with weight 0 T311610', diff saved to https://phabricator.wikimedia.org/P31007 and previous config saved to /var/cache/conftool/dbconfig/20220712-051927-root.json [05:23:39] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Marostegui) Thanks! [05:30:24] PROBLEM - puppet last run on elastic2049 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:35:21] (03PS3) 10Marostegui: mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) [05:36:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Switchover s3 master db1123 -> db1157 [puppet] - 10https://gerrit.wikimedia.org/r/812841 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [05:38:12] (03PS1) 10Marostegui: db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813117 (https://phabricator.wikimedia.org/T311610) [05:59:04] (03CR) 10Ladsgroup: [C: 03+1] db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813117 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [06:00:04] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T0600). [06:00:09] o/ [06:00:12] o/ [06:00:17] !log Starting s3 eqiad failover from db1123 to db1157 - T311610 [06:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:22] T311610: Switchover s3 master db1123 -> db1157 - https://phabricator.wikimedia.org/T311610 [06:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T311610', diff saved to https://phabricator.wikimedia.org/P31008 and previous config saved to /var/cache/conftool/dbconfig/20220712-060031-marostegui.json [06:00:47] read only now [06:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1157 to s3 primary and set section read-write T311610', diff saved to https://phabricator.wikimedia.org/P31009 and previous config saved to /var/cache/conftool/dbconfig/20220712-060058-marostegui.json [06:01:02] all done [06:01:08] I can edit [06:01:15] yup, I can edit too [06:01:23] \o/ [06:02:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3-master [dns] - 10https://gerrit.wikimedia.org/r/812822 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [06:02:41] (03CR) 10Marostegui: [C: 03+2] db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813117 (https://phabricator.wikimedia.org/T311610) (owner: 10Marostegui) [06:04:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123 T311610', diff saved to https://phabricator.wikimedia.org/P31010 and previous config saved to /var/cache/conftool/dbconfig/20220712-060407-root.json [06:07:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10MNovotny_WMF) Approved [06:12:46] !log dbmaint s3@eqiad T310011 [06:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:49] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:13:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:13:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:23:09] (03PS1) 10Marostegui: Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/812951 [06:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31011 and previous config saved to /var/cache/conftool/dbconfig/20220712-062344-root.json [06:23:53] (03CR) 10Marostegui: [C: 03+2] Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/812951 (owner: 10Marostegui) [06:38:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31012 and previous config saved to /var/cache/conftool/dbconfig/20220712-063848-root.json [06:52:32] (03PS1) 10Marostegui: db2163: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813189 (https://phabricator.wikimedia.org/T311493) [06:53:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31013 and previous config saved to /var/cache/conftool/dbconfig/20220712-065352-root.json [06:54:06] (03CR) 10Marostegui: [C: 03+2] db2163: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813189 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123', diff saved to https://phabricator.wikimedia.org/P31014 and previous config saved to /var/cache/conftool/dbconfig/20220712-070240-root.json [07:08:41] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@89cb17d]: subgraph_and_query_mapping: Increase executor memory to 12g, use repartition [07:10:43] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@89cb17d]: subgraph_and_query_mapping: Increase executor memory to 12g, use repartition (duration: 02m 02s) [07:56:23] (03Abandoned) 10Abijeet Patro: WikiPage group description: prefix source page title [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812531 (https://phabricator.wikimedia.org/T312688) (owner: 10Abijeet Patro) [08:07:04] (03CR) 10Filippo Giunchedi: "See inline, LGTM tho" [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [08:21:07] is someone able to restart zuul on contint for T309371 ? [08:21:08] T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 [08:22:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:37] !log hashar@deploy1002 Started deploy [integration/docroot@c2cceaf]: Fix NPM URL for Wikimedia language-data library [08:25:46] !log hashar@deploy1002 Finished deploy [integration/docroot@c2cceaf]: Fix NPM URL for Wikimedia language-data library (duration: 00m 08s) [08:28:43] (03PS1) 10David Caro: wmcs: Fix task title creation [puppet] - 10https://gerrit.wikimedia.org/r/813195 [08:28:52] (03CR) 10CI reject: [V: 04-1] wmcs: Fix task title creation [puppet] - 10https://gerrit.wikimedia.org/r/813195 (owner: 10David Caro) [08:30:27] (03PS1) 10David Caro: wmcs: Fix task title creation [puppet] - 10https://gerrit.wikimedia.org/r/813196 [08:30:37] (03CR) 10CI reject: [V: 04-1] wmcs: Fix task title creation [puppet] - 10https://gerrit.wikimedia.org/r/813196 (owner: 10David Caro) [08:30:50] (03Abandoned) 10David Caro: wmcs: Fix task title creation [puppet] - 10https://gerrit.wikimedia.org/r/813195 (owner: 10David Caro) [08:32:01] (03PS1) 10David Caro: wmcs: Fix task title template [puppet] - 10https://gerrit.wikimedia.org/r/813197 [08:32:10] (03Abandoned) 10David Caro: wmcs: Fix task title creation [puppet] - 10https://gerrit.wikimedia.org/r/813196 (owner: 10David Caro) [08:32:12] (03CR) 10CI reject: [V: 04-1] wmcs: Fix task title template [puppet] - 10https://gerrit.wikimedia.org/r/813197 (owner: 10David Caro) [08:58:06] !log Restarted Gerrit T309371 [08:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:09] T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 [08:58:10] which did not flush anything [09:09:30] (03PS1) 10Volans: CI: fix reported issues [software/cumin] - 10https://gerrit.wikimedia.org/r/813201 [09:10:00] (03Abandoned) 10Volans: doc: set default language [software/cumin] - 10https://gerrit.wikimedia.org/r/801389 (owner: 10Volans) [09:12:29] !log Restarted Zuul T309371 [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:32] T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 [09:29:58] (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch: remove icinga::monitor::elasticsearch::old_jvm_gc_checks [puppet] - 10https://gerrit.wikimedia.org/r/812860 (https://phabricator.wikimedia.org/T288622) (owner: 10Filippo Giunchedi) [09:30:10] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:30:19] (03PS2) 10Filippo Giunchedi: icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847) [09:32:30] (03CR) 10Filippo Giunchedi: [V: 03+2] icinga: switch to prometheus-only probes for commons [puppet] - 10https://gerrit.wikimedia.org/r/812854 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:32:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36254/console" [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [09:34:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36255/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [09:35:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36256/console" [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [09:38:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:38:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:40:03] 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10akosiaris) 05Resolved→03Open >>! In T304289#8060511, @akosiaris wrote: > @Cmjohnson Could you please add some more information why mgmt flapping will be an ongoing issue ? > > Also,... [09:40:35] 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10akosiaris) Note that wtp10XX hosts will be resolved by T307220. [09:40:47] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM, there is a whitespace change but i think the new version is correct." [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [09:44:24] (03PS2) 10Jbond: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [09:45:01] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [09:52:48] (03CR) 10Jbond: "lgtm nits and questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [09:56:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/813201 (owner: 10Volans) [09:56:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:56:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:57:24] (03CR) 10Jbond: [C: 03+2] spdx: Add csr files to the list of files to ignore. [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [09:57:50] (03CR) 10Jbond: [C: 03+2] mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 (owner: 10Jbond) [09:57:59] (03CR) 10Jbond: [C: 03+2] mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:58:20] (03PS6) 10Jbond: mcrouter: update tox configuration [puppet] - 10https://gerrit.wikimedia.org/r/808212 [09:58:28] (03PS8) 10Jbond: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:04:55] (03PS1) 10Marostegui: db2082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813206 (https://phabricator.wikimedia.org/T311475) [10:06:04] (03CR) 10Marostegui: [C: 03+2] db2082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813206 (https://phabricator.wikimedia.org/T311475) (owner: 10Marostegui) [10:09:01] (03PS1) 10Marostegui: mariadb: Productionize db2164 [puppet] - 10https://gerrit.wikimedia.org/r/813207 (https://phabricator.wikimedia.org/T311493) [10:09:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2164 [puppet] - 10https://gerrit.wikimedia.org/r/813207 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:12:02] (03PS1) 10Marostegui: db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813208 (https://phabricator.wikimedia.org/T308331) [10:12:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 for onsite maintenance T308331', diff saved to https://phabricator.wikimedia.org/P31017 and previous config saved to /var/cache/conftool/dbconfig/20220712-101211-root.json [10:12:15] T308331: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 [10:12:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some weight to x1 master until the replica is back from maintenance', diff saved to https://phabricator.wikimedia.org/P31018 and previous config saved to /var/cache/conftool/dbconfig/20220712-101246-marostegui.json [10:13:00] (03CR) 10Marostegui: [C: 03+2] db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/813208 (https://phabricator.wikimedia.org/T308331) (owner: 10Marostegui) [10:13:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [10:14:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [10:15:35] (03PS2) 10Jbond: cli: Add ability to override th amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 [10:15:53] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) @Cmjohnson db1137 is now off and ready to be moved anytime. [10:18:04] (ProbeDown) firing: (2) Service alert2001:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:10] uhuh [10:18:28] looking, I recently switched that probe to paging [10:18:33] * Emperor here [10:19:06] around, kinda [10:19:21] I think we're fine, new probe and I'll silence it [10:19:24] sorry for the noise [10:19:46] ack [10:20:08] Oh okay [10:23:04] (ProbeDown) firing: (2) Service alert1001:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:23:15] <_joe_> what is that probe calling? [10:23:47] that's going to page again, sorry in advance [10:24:10] <_joe_> not having the url in the alert makes it hard to figure out immediately what to check [10:25:06] fair enough, I may look into that [10:26:06] even with the url though there's more to the probe, e.g. the check above is looking for a regex [10:28:11] (03PS1) 10Filippo Giunchedi: icinga: fix ip addresses for commons.wikimedia.org probe [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) [10:28:32] ^ will fix, seeking reviewers [10:28:39] (03CR) 10CI reject: [V: 04-1] icinga: fix ip addresses for commons.wikimedia.org probe [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:29:43] (03PS2) 10Filippo Giunchedi: icinga: fix ip addresses for commons.wikimedia.org probe [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) [10:31:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [10:34:28] (03PS3) 10Jbond: wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 [10:34:37] (03CR) 10Jbond: wmflib: create a helper function for querying puppetdb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [10:46:28] jouncebot: now [10:46:28] No deployments scheduled for the next 2 hour(s) and 13 minute(s) [10:48:25] alright, I’ll test some things on mwdebug then [10:48:41] (specifically mwdebug1002) [10:50:29] (03PS1) 10Jbond: P:debmonitor::client: make ensurable [puppet] - 10https://gerrit.wikimedia.org/r/813216 [10:51:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36257/console" [puppet] - 10https://gerrit.wikimedia.org/r/813216 (owner: 10Jbond) [10:51:41] (03CR) 10Majavah: P:debmonitor::client: make ensurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813216 (owner: 10Jbond) [10:55:31] alright, I’m done :) scap pulled on mwdebug1002 to wipe my changes [11:01:26] godog: making DNS resolution on puppet time means that those probes could fail till puppet runs again if we need to depool a DC [11:02:25] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::client: make ensurable [puppet] - 10https://gerrit.wikimedia.org/r/813216 (owner: 10Jbond) [11:03:51] (03PS2) 10Jbond: P:debmonitor::client: make ensurable [puppet] - 10https://gerrit.wikimedia.org/r/813216 [11:06:03] vgutierrez: good point, I think it is a current limitation of the check/implementation, will have to think how to fix it "properly" [11:07:03] I'll open a task to track it, though in the meantime we're running the check against alert hosts which is .. bound to fail ! [11:09:17] (03PS3) 10Jbond: P:debmonitor::client: make ensurable [puppet] - 10https://gerrit.wikimedia.org/r/813216 [11:09:47] it is possible I'm offbase too and the check actually supports this use case already, filing the task anyways [11:09:48] (03CR) 10Jbond: P:debmonitor::client: make ensurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813216 (owner: 10Jbond) [11:12:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/813216 (owner: 10Jbond) [11:13:15] filed as T312840 [11:13:16] T312840: Better support for blackbox checks against public/frontend endpoints - https://phabricator.wikimedia.org/T312840 [11:13:55] going to lunch [11:14:06] (03CR) 10Jbond: [C: 04-1] "i dont think this will do what you expect" [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:14:18] (03CR) 10Slyngshede: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:17:47] (03CR) 10Jbond: [C: 03+2] P:debmonitor::client: make ensurable [puppet] - 10https://gerrit.wikimedia.org/r/813216 (owner: 10Jbond) [11:19:34] (03PS19) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [11:22:44] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (owner: 10Jbond) [11:25:01] (03CR) 10Jbond: [C: 03+2] wmflib: create a helper function for querying puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/812904 (owner: 10Jbond) [11:26:59] (03PS2) 10Jbond: wmflib: migrae all calls for puppetdb_query to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/812907 [11:28:43] 10Puppet, 10Infrastructure-Foundations, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10The_RedBurn) There's already a mobile redirect code, with... [11:28:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36258/console" [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond) [11:29:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: migrae all calls for puppetdb_query to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/812907 (owner: 10Jbond) [11:29:44] (03PS1) 10Majavah: P:toolforge::prometheus: drop clouddb-services jobs [puppet] - 10https://gerrit.wikimedia.org/r/813219 [11:29:46] (03PS1) 10Majavah: P:toolforge: re-add blackbox monitoring for static [puppet] - 10https://gerrit.wikimedia.org/r/813220 [11:43:32] (03PS2) 10David Caro: wmcs: Fix task title template [puppet] - 10https://gerrit.wikimedia.org/r/813197 [11:48:43] (03CR) 10Filippo Giunchedi: icinga: fix ip addresses for commons.wikimedia.org probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:50:12] (03CR) 10David Caro: [C: 03+2] wmcs: Fix task title template [puppet] - 10https://gerrit.wikimedia.org/r/813197 (owner: 10David Caro) [11:52:11] (03CR) 10David Caro: [C: 03+2] wmcs: Add ceph cluster alerts [alerts] - 10https://gerrit.wikimedia.org/r/812706 (owner: 10David Caro) [11:52:30] (03CR) 10Jbond: [C: 03+1] icinga: fix ip addresses for commons.wikimedia.org probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:53:34] (03CR) 10Jbond: [C: 03+1] icinga: fix ip addresses for commons.wikimedia.org probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:54:56] jbond: nice, I like the text-lb idea [11:55:09] will send a followup now [11:55:33] sgtm :) [11:57:58] (03PS3) 10Filippo Giunchedi: icinga: fix ip addresses for commons.wikimedia.org probe [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) [11:58:00] (03PS11) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [11:58:02] (03PS3) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [11:58:36] (03CR) 10Jbond: [C: 03+1] icinga: fix ip addresses for commons.wikimedia.org probe [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:01:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Rack move, T308331 [12:01:59] T308331: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 [12:02:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Rack move, T308331 [12:02:18] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: fix ip addresses for commons.wikimedia.org probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813213 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:05:59] (03PS1) 10David Caro: wmcs: remove ceph alerts, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813228 [12:10:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:13:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [12:13:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [12:13:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [12:16:08] (03CR) 10Jbond: [C: 03+1] k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:16:30] (03CR) 10Jbond: [C: 03+1] k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:35:04] (03PS1) 10Alexandros Kosiaris: configcluster: Turn-off zookeeper version pin [puppet] - 10https://gerrit.wikimedia.org/r/813233 [12:37:47] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [12:42:47] (03PS2) 10DDesouza: QuickSurveys: Disable 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [12:43:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) [12:45:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1123.eqiad.wmnet with reason: Maintenance [12:45:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1123.eqiad.wmnet with reason: Maintenance [12:45:07] (03PS1) 10Marostegui: install_server: Allow reimage db1185-db1195 [puppet] - 10https://gerrit.wikimedia.org/r/813234 (https://phabricator.wikimedia.org/T306928) [12:46:26] (03Restored) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung) [12:46:30] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage db1185-db1195 [puppet] - 10https://gerrit.wikimedia.org/r/813234 (https://phabricator.wikimedia.org/T306928) (owner: 10Marostegui) [12:46:42] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1185-db1195 [puppet] - 10https://gerrit.wikimedia.org/r/813234 (https://phabricator.wikimedia.org/T306928) (owner: 10Marostegui) [12:46:50] (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung) [12:47:16] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) [12:47:38] (03CR) 10MVernon: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/813234 (https://phabricator.wikimedia.org/T306928) (owner: 10Marostegui) [12:49:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) @Marostegui Should we or could we utilize row E/F for these? [12:56:09] (03PS1) 10Bartosz Dziewoński: Parse 'DiscussionToolsTimestampFormatSwitchTime' config value as UTC [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812956 (https://phabricator.wikimedia.org/T312828) [12:56:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:16] jouncebot: next [12:58:16] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T1300) [12:58:16] In 0 hour(s) and 1 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T1300) [12:58:21] jouncebot: refresh [12:58:21] I refreshed my knowledge about deployments. [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T1300). [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T1300) [13:01:44] I can’t deploy, sorry [13:09:53] (03CR) 10Jbond: "see inline for comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [13:23:09] (03PS1) 10Hnowlan: image-suggestion: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/813242 (https://phabricator.wikimedia.org/T304885) [13:26:03] okay, if nobody else is around I can try to deploy MatmaRex’ change [13:26:08] should be fairly safe, at least [13:26:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Parse 'DiscussionToolsTimestampFormatSwitchTime' config value as UTC [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812956 (https://phabricator.wikimedia.org/T312828) (owner: 10Bartosz Dziewoński) [13:27:01] Lucas_WMDE: thanks [13:33:27] (03CR) 10Jbond: Extend custom raid fact to support Perc 750 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [13:35:09] (03Merged) 10jenkins-bot: Parse 'DiscussionToolsTimestampFormatSwitchTime' config value as UTC [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812956 (https://phabricator.wikimedia.org/T312828) (owner: 10Bartosz Dziewoński) [13:35:43] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:35:45] looks like my `git fetch` is now fetching wmf.20 of all extensions [13:36:18] MatmaRex: the fix should be on mwdebug1001, can you test it? [13:36:37] yeah [13:37:13] Lucas_WMDE: looks good [13:37:23] ok, thanks [13:37:56] syncing [13:38:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:39:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:39:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:39:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:40:42] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/DiscussionTools/modules/CommentItem.js: Backport: [[gerrit:812956|Parse 'DiscussionToolsTimestampFormatSwitchTime' config value as UTC (T312828)]] (duration: 02m 50s) [13:40:47] T312828: "Could not find the comment you're replying to on the page" - https://phabricator.wikimedia.org/T312828 [13:41:17] * Lucas_WMDE still needs to get used to stashbot no longer acknowledging logmsgbot [13:41:37] !log UTC afternoon backport window done [13:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:50] thanks [13:45:54] hm, there are a lot more OOM errors since 13:33 UTC [13:46:20] (my scap didn’t start until 13:37:51; mwdebug scap pull was at 13:36:09) [13:47:33] anyone have an idea what’s going on? not specific to a single wiki or host (though enwiki and enwiktionary seem especially affected) [13:48:30] mainly parsoid hosts, it seems [13:56:04] (03PS20) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [13:56:06] (03PS1) 10Jbond: rake - spdx: fix extname [puppet] - 10https://gerrit.wikimedia.org/r/813244 [13:57:49] (the OOMs seem to have recovered again, fwiw) [13:59:38] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet [14:00:16] PROBLEM - MariaDB Replica Lag: x2 #page on db2142 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:00:49] I'm gonna ack it in VO [14:01:37] Amir1: known issue? [14:01:44] marostegui: Should we turn off the alert [14:01:52] jhathaway: somewhat [14:02:20] here [14:02:21] Krinkle: this is now paging SREs [14:02:36] it should recover on its own but let me double check [14:02:54] I'm here but I knew what it was when I read it :-( [14:03:05] Amir1: i guess yeah [14:03:44] ok, assuming this is mostly-handled, will keep on eye on here [14:03:49] *one [14:04:37] marostegui: all of codfw is lagging, is it because it's chocking on large transactions? [14:04:53] to sumarize for others, this is supposed to be active-active, hence the page, but I think it is not yet in use [14:04:54] or semi-sync :D [14:05:15] (on codfw side, I mean) [14:05:34] jynus: this is one of first steps towards active/active so I think it's important that codfw won't be lagging this much [14:05:43] but maybe not paging right now? [14:05:50] (03CR) 10JHathaway: lists: convert apache template to epp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [14:05:52] At least until the issue is fixed [14:06:08] that's up to you, I guess [14:07:16] okay, I'm going to disable x2 notification for now [14:07:23] (03PS3) 10BCornwall: varnish: Port over traffic_drop from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) [14:07:27] I would definitely delay the enabling of active-active [14:07:31] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) @jclark-ctr we don't have any preference as long as they go to different racks and they are on the private VLAN (as any other db* host) [14:07:35] under the current situation [14:08:07] it was clear last night it was close to saturation [14:08:18] (03CR) 10BCornwall: varnish: Port over traffic_drop from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:08:34] got later better because lower load, but now load increased back again (following traffic) [14:08:37] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2037.codfw.wmnet [14:09:42] (03CR) 10David Caro: wmcs: vps: remove_instance: add support for puppet deactivation (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801784 (owner: 10Majavah) [14:11:53] (here too, was in an interview) [14:12:05] I take it we're ok tho? [14:12:33] (03CR) 10Muehlenhoff: Extend custom raid fact to support Perc 750 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [14:13:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/813228 (owner: 10David Caro) [14:13:52] (03PS11) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [14:13:54] (03PS4) 10David Caro: openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 [14:13:56] (03PS6) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 [14:13:58] (03PS6) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 [14:14:00] (03PS2) 10David Caro: ceph: add alert handling to ceph custer downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812900 [14:14:02] (03PS2) 10David Caro: wmcs: use run_* instead of run_sync/run_async [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812901 [14:14:04] (03PS2) 10David Caro: toolforge.grid.get_cluster_status: show extended queue info [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812902 [14:14:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! nicely done!" [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:14:29] (03CR) 10BCornwall: [C: 03+2] varnish: Port over traffic_drop from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:14:33] (03PS4) 10BCornwall: varnish: Port over traffic_drop from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812424 (https://phabricator.wikimedia.org/T300723) [14:15:11] I resolved it [14:15:25] I hope it doesn't alert again [14:15:30] it's being downtimed [14:15:31] neat, thanks Amir1 ! [14:16:09] if you are looking for inspiration this week, the alerts could live in alertmanager/prometheus nowadays [14:16:13] They are now dowtimed only for codfw, but we need the long term solution to get rid of that lag [14:16:14] * godog runs after the self-plug [14:16:15] Krinkle: ^ [14:18:08] (03PS1) 10Marostegui: mariadb: Productionize db2162 [puppet] - 10https://gerrit.wikimedia.org/r/813246 (https://phabricator.wikimedia.org/T311493) [14:19:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2162 [puppet] - 10https://gerrit.wikimedia.org/r/813246 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [14:20:03] (03PS2) 10JHathaway: lists: convert apache template to epp [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) [14:20:05] (03PS3) 10JHathaway: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) [14:23:08] godog: my list of stuff to do is way too long :( [14:23:57] (03CR) 10David Caro: [C: 03+2] cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [14:24:06] (03CR) 10David Caro: [C: 03+2] openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 (owner: 10David Caro) [14:24:46] Amir1: heheh yeah I can imagine! I was being a little bit facetious too (but not that much!), thanks for bearing with me [14:25:37] (03CR) 10David Caro: [C: 03+2] wmcs: remove ceph alerts, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813228 (owner: 10David Caro) [14:26:10] (03CR) 10JHathaway: lists: add apache security configs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [14:26:58] (03PS2) 10Nskaggs: Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 [14:27:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:27:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:28:19] (03PS3) 10Nskaggs: Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 [14:28:27] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) @RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry I am beginning to move servers in a few minutes, please ping me in IRC if you have any questions. [14:28:45] (03CR) 10Nskaggs: Force depends so setup.py install works (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs) [14:29:22] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: drop clouddb-services jobs [puppet] - 10https://gerrit.wikimedia.org/r/813219 (owner: 10Majavah) [14:29:42] (03CR) 10David Caro: [C: 03+2] P:toolforge: re-add blackbox monitoring for static [puppet] - 10https://gerrit.wikimedia.org/r/813220 (owner: 10Majavah) [14:30:37] !log on going PDU maintenenace in rack A5 [14:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:50] (03Merged) 10jenkins-bot: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [14:30:52] (03Merged) 10jenkins-bot: openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 (owner: 10David Caro) [14:32:15] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [14:32:25] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [14:33:14] (03PS4) 10David Caro: Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs) [14:35:29] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10TK-999) Just wanted to chime in with something that may be of interest - we have been doing some URL normalization (namely, rewriting `/... [14:36:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:36:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:41:12] (03CR) 10David Caro: "Got a non-blocking comment, let me know if/when you want me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [14:44:13] (03CR) 10Jbond: [C: 03+2] rake - spdx: fix extname [puppet] - 10https://gerrit.wikimedia.org/r/813244 (owner: 10Jbond) [14:44:13] PROBLEM - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:45:03] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:46:50] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on druid1008.eqiad.wmnet with reason: T308331 btullis [14:46:50] (03CR) 10David Caro: [C: 03+2] Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs) [14:46:54] T308331: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 [14:47:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on druid1008.eqiad.wmnet with reason: T308331 btullis [14:47:43] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2037.codfw.wmnet with OS bullseye [14:47:52] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [14:48:07] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [14:48:15] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2037.codfw.wmnet with OS bullseye [14:48:15] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [14:48:28] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [14:48:40] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10BTullis) Thanks @Cmjohnson - I've added 3 hours of downtime for druid1008 - but feel free to add more if appropriate. [14:52:18] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [14:52:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2037.codfw.wmnet with OS bullseye [14:52:28] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [14:52:37] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [14:54:50] (03Merged) 10jenkins-bot: Force depends so setup.py install works [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812889 (owner: 10Nskaggs) [14:55:08] (03PS21) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [14:56:54] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [14:57:03] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2037.codfw.wmnet with OS bullseye [14:57:03] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [14:57:21] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [15:01:08] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [15:01:17] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [15:01:22] PROBLEM - IPMI Sensor Status on mw2410 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:02:04] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:02:56] PROBLEM - IPMI Sensor Status on db2154 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:03:46] PROBLEM - IPMI Sensor Status on wdqs2003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:04:12] PROBLEM - IPMI Sensor Status on db2145 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:05:16] PROBLEM - IPMI Sensor Status on db2085 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:05:17] PROBLEM - IPMI Sensor Status on mw2408 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:05:17] PROBLEM - IPMI Sensor Status on mw2409 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:05:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:01] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2037.codfw.wmnet with OS bullseye [15:06:07] (03PS1) 10Jbond: P:aptrepo: install python3-apt required by reprepro-import-updates-keys [puppet] - 10https://gerrit.wikimedia.org/r/813251 [15:06:09] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [15:06:20] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [15:06:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2037.codfw.wmnet with OS bullseye [15:06:28] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [15:06:35] (03PS2) 10Jbond: P:aptrepo: install python3-apt required by reprepro-import-updates-keys [puppet] - 10https://gerrit.wikimedia.org/r/813251 [15:06:37] (03PS22) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [15:06:42] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [15:06:45] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [15:06:54] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [15:10:36] PROBLEM - IPMI Sensor Status on kubernetes2018 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:06] PROBLEM - IPMI Sensor Status on kubernetes2019 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:18] PROBLEM - IPMI Sensor Status on elastic2025 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:20] PROBLEM - IPMI Sensor Status on ganeti2024 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:20] PROBLEM - IPMI Sensor Status on parse2002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:20] PROBLEM - IPMI Sensor Status on maps2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:38] PROBLEM - IPMI Sensor Status on puppetmaster2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:12:02] RECOVERY - Juniper alarms on asw-a-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:13:10] PROBLEM - IPMI Sensor Status on mw2403 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:14:18] PROBLEM - IPMI Sensor Status on logstash2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:16:42] PROBLEM - IPMI Sensor Status on mc2020 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:18:20] PROBLEM - IPMI Sensor Status on mw2402 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:18:32] PROBLEM - IPMI Sensor Status on mw2404 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:18:44] PROBLEM - IPMI Sensor Status on ores1009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:21:22] PROBLEM - IPMI Sensor Status on mw2411 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:21:22] PROBLEM - IPMI Sensor Status on mw2407 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:21:32] PROBLEM - IPMI Sensor Status on mw2406 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:24:47] (03PS1) 10Filippo Giunchedi: Fix query string quoting in dashboard/runbook URLs [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) [15:24:57] (03PS1) 10Filippo Giunchedi: Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) [15:26:32] PROBLEM - IPMI Sensor Status on pc2011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:27:27] (03CR) 10CI reject: [V: 04-1] Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [15:30:11] (03PS2) 10Filippo Giunchedi: Test for unquoted query strings in runbook/dashboard [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) [15:30:16] (03CR) 10Muehlenhoff: P:aptrepo: install python3-apt required by reprepro-import-updates-keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813251 (owner: 10Jbond) [15:30:22] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2037.codfw.wmnet with OS bullseye [15:30:33] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye executed... [15:33:02] RECOVERY - IPMI Sensor Status on db2154 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:33:48] RECOVERY - IPMI Sensor Status on wdqs2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:34:16] RECOVERY - IPMI Sensor Status on db2145 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:35:26] RECOVERY - IPMI Sensor Status on db2085 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:35:26] RECOVERY - IPMI Sensor Status on mw2408 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:35:28] RECOVERY - IPMI Sensor Status on mw2409 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:37:32] PROBLEM - Host scandium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:57] (03PS1) 10Jbond: spdx: fix convert role/profile jobs [puppet] - 10https://gerrit.wikimedia.org/r/813256 [15:39:59] (03PS1) 10Jbond: idp: add spdx headers to idp role and profile [puppet] - 10https://gerrit.wikimedia.org/r/813257 [15:40:44] (03CR) 10CI reject: [V: 04-1] spdx: fix convert role/profile jobs [puppet] - 10https://gerrit.wikimedia.org/r/813256 (owner: 10Jbond) [15:41:04] PROBLEM - Host db2104.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:04] PROBLEM - Host db2153.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:04] PROBLEM - Host db2154.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:08] PROBLEM - Host graphite2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:16] PROBLEM - Host wdqs2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:57] (03CR) 10Jbond: "see inline was a stray comment i thought i had removed 😊" [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [15:42:40] PROBLEM - IPMI Sensor Status on mw2405 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:44:12] RECOVERY - Host scandium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.11 ms [15:45:03] PROBLEM - Host graphite2003 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:45:10] PROBLEM - Host maps2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:13] >.> [15:45:24] PROBLEM - Host logstash2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:24] PROBLEM - Host wdqs2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:29] papaul: is ^ you [15:45:36] PROBLEM - Host kubernetes2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:45:40] PROBLEM - Host mw2403 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:52] RhinosF1: yes maintenance [15:45:56] PROBLEM - Host db2121 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:02] PROBLEM - Host doc2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:08] PROBLEM - Host db2132 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:14] PROBLEM - Host pc2011 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] PROBLEM - Host acmechief2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] PROBLEM - Host puppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] PROBLEM - Host puppetmaster2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] PROBLEM - Host puppetmaster2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] PROBLEM - Host rdb2007 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:16] PROBLEM - Host rdb2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:20] PROBLEM - Host db2145 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:20] PROBLEM - Host parse2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:24] PROBLEM - Host people2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:26] PROBLEM - Host db2079 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:28] PROBLEM - Host db2153 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:28] PROBLEM - Host puppetdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:28] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:29] PROBLEM - Host mw2402 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:30] PROBLEM - Host mw2404 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:34] PROBLEM - Host mw2407 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:34] PROBLEM - Host db2085 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:36] PROBLEM - Host mw2405 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:36] PROBLEM - Host mw2410 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:36] PROBLEM - Host mw2406 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:36] PROBLEM - Host mw2411 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:36] PROBLEM - Host mw2408 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:38] PROBLEM - Host mwdebug2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:38] PROBLEM - Host parse2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:38] PROBLEM - Host poolcounter2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:42] PROBLEM - Host contint2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:44] PROBLEM - Host db2104 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:44] PROBLEM - Host db2154 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:47] codfw power issue? [15:46:48] RhinosF1: cy1 knock it out [15:46:48] PROBLEM - Host ganeti2024 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:50] PROBLEM - Host mw2409 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:52] PROBLEM - Host parse2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:58] PROBLEM - Host ganeti2023 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:58] PROBLEM - Host mc2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:06] PROBLEM - Host db2079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:06] PROBLEM - Host db2085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:12] oh oh [15:47:14] PROBLEM - Host elastic2025 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:16] PROBLEM - Host kubernetes2018 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:16] papaul: are we losing all power there, or? [15:47:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:22] they disconnect all in the rack [15:47:24] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:37] bblack: yes CY2 knocked all down [15:47:47] * Emperor here [15:47:52] mistake on they end [15:47:54] should we depool codfw now or it recovers? [15:47:56] here as well [15:48:13] Amir1: it will revovers [15:48:16] PROBLEM - Host parse2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:17] they are working on it [15:48:28] PROBLEM - Juniper virtual chassis ports on asw-a-codfw is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:48:43] (03PS1) 10BBlack: Depool codfw front edge traffic [dns] - 10https://gerrit.wikimedia.org/r/813261 [15:48:46] bblack: just rack A5 [15:48:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic1065.eqiad.wmnet with reason: firmware update T312298 [15:48:50] PROBLEM - MariaDB Replica IO: s7 on db2087 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2121.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2121.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:48:50] PROBLEM - MariaDB Replica IO: s2 on db2088 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2104.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:48:50] PROBLEM - MariaDB Replica IO: s7 on db2120 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2121.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2121.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:48:51] T312298: Update NIC firmware on all Elastic PowerEdge R440 elastic hosts - https://phabricator.wikimedia.org/T312298 [15:48:53] depooling from edge for now just to limit impact, just in case [15:48:54] PROBLEM - MariaDB Replica IO: s8 on db2098 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:48:54] PROBLEM - MariaDB Replica IO: s2 on db2107 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2104.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:48:55] (03CR) 10Ladsgroup: [C: 03+1] Depool codfw front edge traffic [dns] - 10https://gerrit.wikimedia.org/r/813261 (owner: 10BBlack) [15:48:58] (03CR) 10Jbond: [C: 03+1] lists: convert apache template to epp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [15:49:01] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic1065.eqiad.wmnet with reason: firmware update T312298 [15:49:14] elastic, swift will be ok? [15:49:34] PROBLEM - MariaDB Replica IO: m1 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2132.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2132.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:49:48] RECOVERY - Host parse2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms [15:49:51] (03CR) 10BBlack: [V: 03+2 C: 03+2] Depool codfw front edge traffic [dns] - 10https://gerrit.wikimedia.org/r/813261 (owner: 10BBlack) [15:49:56] PROBLEM - MariaDB Replica IO: s8 on db2086 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:49:56] PROBLEM - MariaDB Replica IO: s2 on db2125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2104.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:49:56] PROBLEM - MariaDB Replica IO: s8 on db2084 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:49:56] PROBLEM - MariaDB Replica IO: s2 on db2097 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2104.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:49:58] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:49:58] (KubernetesCalicoDown) firing: (2) kubernetes2018.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:49:58] probably in there ^ somewhere, but Jenkins 502 [15:50:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:06] PROBLEM - MariaDB Replica IO: s2 on db2138 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2104.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:08] PROBLEM - MariaDB Replica IO: s7 on db2086 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2121.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2121.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:09] TheresNoTime: jenkins is codfw primary [15:50:10] probably mysql needs maual restart on those hosts [15:50:18] RECOVERY - Host rdb2007 is UP: PING OK - Packet loss = 0%, RTA = 30.03 ms [15:50:18] RECOVERY - Host parse2001 is UP: PING OK - Packet loss = 0%, RTA = 30.10 ms [15:50:18] RECOVERY - Host db2145 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:50:18] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:50:20] RECOVERY - Host kubernetes2018 is UP: PING WARNING - Packet loss = 71%, RTA = 30.08 ms [15:50:20] RECOVERY - Host mw2404 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [15:50:20] RECOVERY - Host db2079 is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [15:50:20] RECOVERY - Host parse2003 is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [15:50:20] RECOVERY - Host mw2409 is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [15:50:22] RECOVERY - Host mw2407 is UP: PING OK - Packet loss = 0%, RTA = 30.10 ms [15:50:22] RECOVERY - Host mw2406 is UP: PING OK - Packet loss = 0%, RTA = 30.09 ms [15:50:22] RECOVERY - Host kubernetes2019 is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [15:50:22] RECOVERY - Host mw2402 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:50:23] RECOVERY - Host mw2403 is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [15:50:24] PROBLEM - MariaDB Replica IO: s8 on db2163 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:24] oh hello [15:50:24] RECOVERY - Host db2154 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [15:50:24] RECOVERY - Host ganeti2023 is UP: PING OK - Packet loss = 0%, RTA = 31.34 ms [15:50:26] RECOVERY - Host db2104 is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms [15:50:28] RECOVERY - Host mw2408 is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:50:28] RECOVERY - Host db2153 is UP: PING OK - Packet loss = 0%, RTA = 30.16 ms [15:50:28] !log codfw dns depooled for front edge traffic [15:50:30] PROBLEM - MariaDB Replica IO: s7 on db2098 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2121.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2121.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:32] RECOVERY - Juniper virtual chassis ports on asw-a-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:50:32] RECOVERY - Host mw2405 is UP: PING OK - Packet loss = 0%, RTA = 30.12 ms [15:50:32] RECOVERY - Host db2085 is UP: PING OK - Packet loss = 0%, RTA = 30.10 ms [15:50:34] RECOVERY - Host ml-serve2001 is UP: PING WARNING - Packet loss = 66%, RTA = 31.24 ms [15:50:34] RECOVERY - Host ganeti2024 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [15:50:36] RECOVERY - Host parse2002 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:50:36] RECOVERY - Host db2132 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [15:50:42] RECOVERY - Host mw2411 is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [15:50:42] RECOVERY - Host mw2410 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:50:42] PROBLEM - MariaDB Replica IO: s8 on db2152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:42] PROBLEM - MariaDB Replica IO: s8 on db2161 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:50:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:50:50] RECOVERY - Host pc2011 is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [15:50:51] RECOVERY - Host graphite2003 #page is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [15:50:54] RECOVERY - Host maps2005 is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:50:54] RECOVERY - Host logstash2001 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:50:56] RECOVERY - Host elastic2025 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [15:50:56] RECOVERY - Host db2121 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:51:02] PROBLEM - MariaDB Replica IO: s7 on db2150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2121.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2121.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:51:10] RECOVERY - Host mc2020 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [15:51:12] RECOVERY - Host puppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [15:51:12] RECOVERY - Host wdqs2003 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [15:51:22] RECOVERY - Host contint2001 is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [15:51:23] got one page, for one host out of all of these, graphite2003. no others [15:51:32] seems too fast for reboots, must've mostly been switch outages? [15:51:50] yeah I was expecting a lot of pages as well given the severity [15:52:05] hmmm nope, I see some low host uptimes at the OS level [15:52:07] bblack: that uptime for the dbhost I'm checking is one minute [15:52:08] <_joe_> contint2001:~$ uptime [15:52:09] <_joe_> 15:51:57 up 2 min, 1 user, load average: 13.01, 4.99, 1.83 [15:52:10] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:52:13] page for a single host is kind of rare, wonder why for graphite hosts [15:52:14] <_joe_> all rebooted [15:52:19] lots of person hours will be on just putting e.g. sql servers up [15:52:22] papaul reported they lost power (to the swtich) [15:52:24] RECOVERY - Host puppetmaster2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.62 ms [15:52:24] RECOVERY - Host puppetmaster2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.99 ms [15:52:24] RECOVERY - Host rdb2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.73 ms [15:52:32] hosts lost power too [15:52:34] jynus: I'm on it [15:52:35] mutante: the the whole rack [15:52:39] it's ok [15:52:40] papaul: just A5 rack? [15:52:44] yes [15:52:46] those won't cause immediate issues [15:53:05] leaving the dns depool in place for now, it will reduce impact while we sort things out [15:53:08] RECOVERY - Host db2104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [15:53:08] RECOVERY - Host db2153.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.74 ms [15:53:08] RECOVERY - Host db2154.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.61 ms [15:53:12] RECOVERY - Host graphite2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms [15:53:12] RECOVERY - Host db2079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.17 ms [15:53:12] RECOVERY - Host db2085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.28 ms [15:53:22] RECOVERY - Host wdqs2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [15:53:40] PROBLEM - MariaDB Replica IO: s8 on db2100 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:53:58] one of them is s2 master [15:53:58] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:53:58] PROBLEM - MariaDB Replica SQL: pc1 on pc2011 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:54:02] sigh [15:54:06] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:54:06] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:54:10] PROBLEM - MariaDB Replica IO: s1 on db2085 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:54:16] PROBLEM - Juniper alarms on asw-a-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:54:27] netbox view of that rack for ref: https://netbox.wikimedia.org/dcim/racks/47/ [15:54:32] PROBLEM - Check systemd state on mw2409 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:38] PROBLEM - MariaDB Replica SQL: s2 on db2104 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:54:44] PROBLEM - mysqld processes on db2079 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:54:44] PROBLEM - MariaDB Replica SQL: m1 on db2132 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:54:46] <_joe_> in all this, we have a thumbor alert [15:54:50] <_joe_> not sure if related [15:54:52] PROBLEM - MariaDB Replica SQL: s8 on db2079 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:54:58] PROBLEM - MariaDB Replica SQL: s1 on db2153 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:00] PROBLEM - MariaDB Replica IO: s8 on db2154 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:04] PROBLEM - MariaDB Replica SQL: s1 on db2145 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:06] PROBLEM - Check systemd state on mw2408 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:14] PROBLEM - mysqld processes on pc2011 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:55:14] PROBLEM - Check systemd state on mw2406 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:15] PROBLEM - Check systemd state on parse2003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:17] PROBLEM - MariaDB read only s8 #page on db2079 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:55:20] PROBLEM - MariaDB Replica SQL: s7 on db2121 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:20] ferm? [15:55:20] PROBLEM - Check systemd state on pc2011 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:20] PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:22] PROBLEM - MariaDB read only s1 on db2153 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:55:24] PROBLEM - mysqld processes on db2154 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:55:24] PROBLEM - MariaDB Replica SQL: s8 on db2085 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:24] PROBLEM - Check systemd state on db2104 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:24] PROBLEM - carbon-cache@c service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:55:24] PROBLEM - MariaDB Event Scheduler pc1 on pc2011 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [15:55:25] PROBLEM - mysqld processes on db2121 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:55:28] PROBLEM - Check whether ferm is active by checking the default input chain on parse2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:55:30] PROBLEM - Check systemd state on graphite2003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:30] PROBLEM - Check systemd state on logstash2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,opensearch_1@production-elk7-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:34] PROBLEM - carbon-cache@f service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:55:36] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [15:55:38] and s7 master [15:55:40] PROBLEM - Check systemd state on parse2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:40] (03CR) 10Ori: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [15:55:50] PROBLEM - Check systemd state on mw2407 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:50] PROBLEM - carbon-cache@d service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:55:51] (03CR) 10Ori: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [15:55:54] PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:56] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:55:58] PROBLEM - Check systemd state on mw2410 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:58] PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:58] PROBLEM - MariaDB read only s1 on db2085 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:55:58] PROBLEM - mysqld processes on db2153 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:55:58] PROBLEM - MariaDB Replica IO: m1 on db2132 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:00] PROBLEM - MariaDB Replica IO: pc1 on pc2011 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:00] PROBLEM - Check systemd state on mw2405 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:04] PROBLEM - Check systemd state on rdb2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:06] PROBLEM - Check systemd state on kubernetes2019 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:08] PROBLEM - Check systemd state on mw2403 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:09] icinga please shut up [15:56:10] PROBLEM - carbon-cache@h service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:56:10] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:10] PROBLEM - Check systemd state on ganeti2023 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:10] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:56:10] PROBLEM - carbon-cache@b service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:56:12] PROBLEM - MariaDB read only pc1 on pc2011 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:56:14] I'm on it [15:56:22] PROBLEM - MariaDB Replica IO: s7 on db2118 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2121.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2121.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:24] PROBLEM - MariaDB read only s8 on db2085 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:56:24] PROBLEM - MariaDB Replica IO: s1 on db2145 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:25] PROBLEM - MariaDB read only m1 #page on db2132 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:56:26] PROBLEM - Check systemd state on mw2404 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:28] RECOVERY - MariaDB Replica IO: s2 on db2125 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:28] RECOVERY - MariaDB Replica IO: s2 on db2097 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:34] PROBLEM - carbon-frontend-relay service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:56:44] RECOVERY - MariaDB Replica IO: s2 on db2138 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:45] PROBLEM - Check systemd state on db2145 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:50] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:56:52] PROBLEM - Check systemd state on mw2402 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw2407 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:57:00] PROBLEM - mysqld processes on db2145 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:57:00] RECOVERY - MariaDB Replica SQL: s2 on db2104 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:02] PROBLEM - MariaDB Replica IO: s8 on db2085 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:06] PROBLEM - carbon-cache@a service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:57:08] PROBLEM - MariaDB Replica SQL: s1 on db2085 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:10] can we restart ferm on all via cumin? [15:57:16] RECOVERY - MariaDB Replica IO: s7 on db2098 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:20] PROBLEM - carbon-cache@e service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:57:22] PROBLEM - carbon-cache@g service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:57:32] RECOVERY - Check systemd state on mw2408 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:38] RECOVERY - MariaDB Replica IO: s7 on db2087 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:38] RECOVERY - MariaDB Replica IO: s2 on db2088 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:39] s2 and s7 should be back [15:57:40] RECOVERY - MariaDB Replica IO: s7 on db2120 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:42] checking s8 [15:57:44] RECOVERY - MariaDB Replica IO: s2 on db2107 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:44] RECOVERY - MariaDB Replica SQL: s7 on db2121 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:48] RECOVERY - mysqld processes on db2121 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:57:48] RECOVERY - MariaDB Replica IO: s7 on db2150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:57:54] RECOVERY - Check systemd state on logstash2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:55] !log mw2405 - restarted ferm [15:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:08] and it was s8 master as well, you got to be kidding me. three masters in one rack [15:58:14] PROBLEM - carbon-frontend-relay metric drops on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [15:58:24] RECOVERY - Check systemd state on mw2405 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:30] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2019 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:58:34] PROBLEM - Check whether ferm is active by checking the default input chain on parse2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:58:44] RECOVERY - MariaDB Replica IO: s7 on db2118 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:10] PROBLEM - MariaDB Replica SQL: s8 on db2154 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:10] RECOVERY - MariaDB Replica IO: s7 on db2086 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:14] PROBLEM - Host elastic1065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:20] PROBLEM - carbon-local-relay service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:59:32] RECOVERY - mysqld processes on db2079 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:59:34] RECOVERY - MariaDB Replica IO: s8 on db2163 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:36] PROBLEM - Check systemd state on mw2411 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:44] RECOVERY - MariaDB Replica SQL: s8 on db2079 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:52] and s8 is fixed now [15:59:58] RECOVERY - MariaDB Replica IO: s8 on db2152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:58] RECOVERY - MariaDB Replica IO: s8 on db2161 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:00:04] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:11] RECOVERY - MariaDB read only s8 #page on db2079 is OK: Version 10.4.25-MariaDB-log, Uptime 103s, read_only: True, event_scheduler: True, 2053.46 QPS, connection latency: 0.004370s, query latency: 0.000369s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:00:12] RECOVERY - MariaDB Replica IO: s8 on db2098 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:00:12] two replicas need restart as well [16:00:45] restarting ferm manually on some codfw mw* hosts [16:00:52] RECOVERY - MariaDB Replica IO: s8 on db2100 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:01:00] RECOVERY - Check systemd state on mw2403 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:18] RECOVERY - Check systemd state on mw2404 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:18] RECOVERY - MariaDB Replica IO: s8 on db2086 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:01:20] RECOVERY - MariaDB Replica IO: s8 on db2084 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:01:20] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.2059 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:01:30] RECOVERY - Juniper alarms on asw-a-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:01:50] PROBLEM - Check systemd state on db2079 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:52] PROBLEM - mysqld processes on db2132 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:02:04] PROBLEM - Check systemd state on kubernetes2018 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:14] PROBLEM - Check whether ferm is active by checking the default input chain on mw2410 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:02:44] RECOVERY - Check systemd state on mw2406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:52] RECOVERY - mysqld processes on db2154 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:03:06] !log mw2401 through mw2410 - performing ferm restarts (without cumin, has its own issue) [16:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:28] RECOVERY - Check systemd state on mw2410 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:20] RECOVERY - MariaDB Replica SQL: s8 on db2154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:04:20] PROBLEM - Check whether ferm is active by checking the default input chain on db2079 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:04:26] RECOVERY - Check systemd state on mw2402 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:30] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:30] RECOVERY - Check systemd state on mw2409 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:34] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:44] PROBLEM - Check whether ferm is active by checking the default input chain on db2145 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:05:06] RECOVERY - MariaDB Replica IO: s8 on db2154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:05:08] !log parse200[1-3] - restarted ferm [16:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:20] RECOVERY - Check systemd state on parse2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:32] RECOVERY - Check systemd state on graphite2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:41] (Emergency syslog message) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:05:42] PROBLEM - Check whether ferm is active by checking the default input chain on db2104 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:05:42] RECOVERY - Check systemd state on parse2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:46] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:05:48] RECOVERY - Host elastic1065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [16:06:00] RECOVERY - Check systemd state on parse2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:00] RECOVERY - MariaDB Replica IO: m1 on db2132 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:06:02] RECOVERY - MariaDB read only s1 on db2085 is OK: Version 10.4.25-MariaDB-log, Uptime 117s, read_only: True, event_scheduler: True, 29.30 QPS, connection latency: 0.003983s, query latency: 0.000532s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:06:24] RECOVERY - MariaDB read only s8 on db2085 is OK: Version 10.4.25-MariaDB-log, Uptime 107s, read_only: True, event_scheduler: True, 21.52 QPS, connection latency: 0.003869s, query latency: 0.000630s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:06:25] RECOVERY - MariaDB read only m1 #page on db2132 is OK: Version 10.4.25-MariaDB-log, Uptime 48s, read_only: True, event_scheduler: True, 1624.07 QPS, connection latency: 0.002881s, query latency: 0.000202s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:06:30] (03PS1) 10Papaul: Add new pdu model for ps1-a5-codfw [puppet] - 10https://gerrit.wikimedia.org/r/813264 (https://phabricator.wikimedia.org/T309957) [16:06:50] (03CR) 10Ori: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [16:06:54] PROBLEM - MariaDB Replica IO: s1 on db2153 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:06:54] PROBLEM - MariaDB read only s1 on db2145 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:07:04] RECOVERY - mysqld processes on db2132 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:07:14] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:07:24] RECOVERY - MariaDB Replica SQL: m1 on db2132 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:07:39] Amir1: db2079 , db2104 seem like they need a systemctl restart ferm [16:07:54] RECOVERY - mysqld processes on pc2011 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:07:57] on it [16:08:02] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:08:06] RECOVERY - MariaDB Event Scheduler pc1 on pc2011 is OK: Version 10.4.25-MariaDB-log, Uptime 97s, read_only: False, event_scheduler: True, 2191.02 QPS, connection latency: 0.004576s, query latency: 0.000462s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [16:08:10] checked icinga for all ferm-related alerts [16:08:11] (KubernetesCalicoDown) resolved: (2) kubernetes2018.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:08:15] (KubernetesCalicoDown) resolved: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:08:38] RECOVERY - MariaDB Replica IO: m1 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:08:40] RECOVERY - MariaDB Replica IO: pc1 on pc2011 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:08:50] RECOVERY - MariaDB read only pc1 on pc2011 is OK: Version 10.4.25-MariaDB-log, Uptime 141s, read_only: False, event_scheduler: True, 2254.39 QPS, connection latency: 0.003218s, query latency: 0.000276s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:08:52] RECOVERY - MariaDB Replica SQL: pc1 on pc2011 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:09:02] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:10] PROBLEM - Check whether ferm is active by checking the default input chain on pc2011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:09:18] PROBLEM - Check whether ferm is active by checking the default input chain on maps2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:09:35] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) @RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry All your servers are moved, @MoritzMuehlenhoff I am not able to ssh into yours, I am not sure if that is expecte... [16:09:50] RECOVERY - MariaDB Replica IO: s8 on db2085 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:09:58] RECOVERY - MariaDB Replica SQL: s1 on db2085 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:10:36] RECOVERY - MariaDB Replica SQL: s8 on db2085 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:10:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1026.eqiad.wmnet [16:11:32] RECOVERY - MariaDB Replica IO: s1 on db2145 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:38] RECOVERY - Host puppetdb2002 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [16:11:40] RECOVERY - MariaDB Replica IO: s1 on db2085 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:44] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [16:11:46] !log repair networking on puppetdb2002 [16:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:50] hnowlan: ^ is the rb1026 depool related to incident? [16:12:00] RECOVERY - MariaDB read only s1 on db2145 is OK: Version 10.4.25-MariaDB-log, Uptime 54s, read_only: True, event_scheduler: True, 1133.96 QPS, connection latency: 0.003289s, query latency: 0.000446s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:12:03] bblack: no, it's being moved [16:12:14] can it wait? [16:12:18] RECOVERY - mysqld processes on db2145 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:12:44] RECOVERY - MariaDB Replica SQL: s1 on db2153 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:46] bblack: it's already down I'm afraid, didn't see the incident at the time :/ [16:12:51] ok [16:12:52] RECOVERY - MariaDB Replica SQL: s1 on db2145 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:00] RECOVERY - IPMI Sensor Status on kubernetes2019 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:13:08] RECOVERY - Check systemd state on db2104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:08] RECOVERY - MariaDB read only s1 on db2153 is OK: Version 10.4.25-MariaDB-log, Uptime 87s, read_only: True, event_scheduler: True, 1836.55 QPS, connection latency: 0.004236s, query latency: 0.000504s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:13:18] RECOVERY - IPMI Sensor Status on parse2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:13:18] RECOVERY - IPMI Sensor Status on ganeti2024 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:13:18] RECOVERY - IPMI Sensor Status on maps2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:13:32] RECOVERY - Check systemd state on mw2407 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:42] RECOVERY - mysqld processes on db2153 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:13:52] PROBLEM - MariaDB Replica Lag: pc1 on pc2011 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 669.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:20] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 58267 bytes in 3.674 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [16:14:30] RECOVERY - Check systemd state on db2145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:32] RECOVERY - MariaDB Replica IO: s1 on db2153 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:34] PROBLEM - MariaDB Replica Lag: s1 on db2153 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1003.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:14:37] now all replication is caught up, I'm going restart ferm [16:14:40] RECOVERY - Check systemd state on db2079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:50] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:56] RECOVERY - Check systemd state on kubernetes2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:56] RECOVERY - carbon-cache@c service on graphite2003 is OK: OK - carbon-cache@c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:56] RECOVERY - carbon-cache@a service on graphite2003 is OK: OK - carbon-cache@a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:56] RECOVERY - carbon-frontend-relay service on graphite2003 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:56] RECOVERY - carbon-cache@d service on graphite2003 is OK: OK - carbon-cache@d is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:56] RECOVERY - carbon-cache@b service on graphite2003 is OK: OK - carbon-cache@b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:58] RECOVERY - carbon-cache@h service on graphite2003 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:58] RECOVERY - carbon-local-relay service on graphite2003 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:58] RECOVERY - carbon-cache@e service on graphite2003 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:58] RECOVERY - carbon-cache@f service on graphite2003 is OK: OK - carbon-cache@f is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:59] RECOVERY - carbon-cache@g service on graphite2003 is OK: OK - carbon-cache@g is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:14:59] PROBLEM - Host restbase1026 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:08] RECOVERY - Check systemd state on mw2411 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:20] !log repair networking on people2002 [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:24] RECOVERY - IPMI Sensor Status on mw2403 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:15:32] RECOVERY - Host people2002 is UP: PING OK - Packet loss = 0%, RTA = 37.28 ms [16:15:42] RECOVERY - Check systemd state on pc2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:42] RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:14] RECOVERY - carbon-frontend-relay metric drops on graphite1004 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [16:16:15] RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:24] RECOVERY - Check systemd state on rdb2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:26] RECOVERY - Check systemd state on kubernetes2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:33] (03CR) 10Jbond: "see inline but dont consider this blocking. i need to pop out so examples may be a bit rushed and wont be around for a bit but can chart " [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [16:16:42] RECOVERY - IPMI Sensor Status on logstash2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:16:42] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 3484413360 and 1956 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:18:24] RECOVERY - Host acmechief2001 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [16:18:58] (03PS1) 10David Caro: wmcs: don't page for most checks [puppet] - 10https://gerrit.wikimedia.org/r/813267 [16:19:02] RECOVERY - IPMI Sensor Status on mc2020 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:19:02] RECOVERY - MariaDB Replica Lag: pc1 on pc2011 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:19:02] RECOVERY - Check systemd state on ganeti2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:16] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:39] !log rebooting mwdebug2001 via ganeti2022 [16:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:42] RECOVERY - MariaDB Replica Lag: s1 on db2153 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:20:44] RECOVERY - IPMI Sensor Status on mw2402 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:21:00] RECOVERY - IPMI Sensor Status on mw2404 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:22:26] (03CR) 10Vivian Rook: [C: 03+1] wmcs: don't page for most checks [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [16:23:12] RECOVERY - Host doc2001 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [16:23:15] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [16:23:42] (Emergency syslog message) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:24:00] RECOVERY - IPMI Sensor Status on mw2407 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:24:10] RECOVERY - IPMI Sensor Status on mw2406 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:24:50] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:25:00] 10SRE, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [16:25:12] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [16:25:40] (03CR) 10David Caro: [V: 03+1 C: 03+2] novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [16:25:43] (03CR) 10David Caro: [C: 03+2] novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [16:25:51] (03CR) 10David Caro: [V: 03+1 C: 03+2] novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [16:26:58] RECOVERY - Check whether ferm is active by checking the default input chain on parse2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:27:34] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:27:42] RECOVERY - Host restbase1026 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:28:26] PROBLEM - cassandra-b SSL 10.64.48.127:7001 on restbase1025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:28:28] RECOVERY - Check whether ferm is active by checking the default input chain on mw2407 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:28:28] PROBLEM - IPMI Sensor Status on wdqs2003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:28:32] RECOVERY - SSH on restbase1025 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:29:02] PROBLEM - cassandra-c service on restbase1025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:29:23] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:26] RECOVERY - IPMI Sensor Status on pc2011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:29:30] PROBLEM - cassandra-b service on restbase1025 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:29:42] RECOVERY - Restbase root url on restbase1025 is OK: HTTP OK: HTTP/1.1 200 - 17235 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/RESTBase [16:30:00] PROBLEM - cassandra-c SSL 10.64.48.128:7001 on restbase1025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:30:00] PROBLEM - cassandra-a SSL 10.64.48.126:7001 on restbase1025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:30:01] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:02] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2019 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:30:04] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:08] RECOVERY - Check whether ferm is active by checking the default input chain on parse2003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:31:36] RECOVERY - cassandra-c service on restbase1025 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:31:58] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:58] RECOVERY - cassandra-c CQL 10.64.48.128:9042 on restbase1025 is OK: TCP OK - 0.000 second response time on 10.64.48.128 port 9042 https://phabricator.wikimedia.org/T93886 [16:32:02] RECOVERY - cassandra-b service on restbase1025 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:14] (03CR) 10Papaul: [C: 03+2] Add new pdu model for ps1-a5-codfw [puppet] - 10https://gerrit.wikimedia.org/r/813264 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [16:32:30] RECOVERY - cassandra-c SSL 10.64.48.128:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-c valid until 2023-04-14 11:21:22 +0000 (expires in 275 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:32:30] RECOVERY - cassandra-a SSL 10.64.48.126:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-a valid until 2023-04-14 11:21:17 +0000 (expires in 275 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:32:32] RECOVERY - cassandra-b SSL 10.64.48.127:7001 on restbase1025 is OK: SSL OK - Certificate restbase1025-b valid until 2023-04-14 11:21:19 +0000 (expires in 275 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:32:55] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:20] RECOVERY - Check systemd state on poolcounter2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:22] RECOVERY - Host poolcounter2003 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [16:33:26] RECOVERY - cassandra-a CQL 10.64.48.126:9042 on restbase1025 is OK: TCP OK - 0.000 second response time on 10.64.48.126 port 9042 https://phabricator.wikimedia.org/T93886 [16:33:44] RECOVERY - Check whether ferm is active by checking the default input chain on mw2410 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:33:50] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:00] RECOVERY - cassandra-b CQL 10.64.48.127:9042 on restbase1025 is OK: TCP OK - 0.000 second response time on 10.64.48.127 port 9042 https://phabricator.wikimedia.org/T93886 [16:34:18] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:32] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2003 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:34:44] RECOVERY - IPMI Sensor Status on mw2410 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:34:56] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:08] RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.297 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:35:42] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10RhinosF1) Would it better if Service Owners depooled and/or down-timed services before the remainder of these? A2, A4 & A5 have all had power losses during the maintenance (... [16:36:02] 10SRE, 10ops-eqiad: restbase1025 down - https://phabricator.wikimedia.org/T312805 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Fixed [16:36:39] (03CR) 10Jforrester: [C: 03+1] Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [16:36:51] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2037.codfw.wmnet with OS bullseye [16:37:01] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye [16:37:33] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) @btullis I do have a raid controller, when do you want to schedule this? Tomorrow, Wednesday 1530UTC? [16:37:57] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:38:03] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:38:41] RECOVERY - Check whether ferm is active by checking the default input chain on db2079 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:39:17] RECOVERY - Check whether ferm is active by checking the default input chain on pc2011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:39:23] RECOVERY - Check whether ferm is active by checking the default input chain on maps2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:40:23] RECOVERY - Host mwdebug2001 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [16:41:19] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service,prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:39] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:42:35] RECOVERY - IPMI Sensor Status on kubernetes2018 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:42:56] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10BTullis) @Cmjohnson I can do it now or in the next 30 minutes if that's good for you? Otherwise, yes tomorrow at 15:30 UTC is good too. [16:43:32] RECOVERY - IPMI Sensor Status on elastic2025 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:44:01] RECOVERY - IPMI Sensor Status on puppetmaster2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:44:25] RECOVERY - IPMI Sensor Status on mw2405 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:45:24] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1026.eqiad.wmnet [16:45:53] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:55] RECOVERY - Check whether ferm is active by checking the default input chain on db2145 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:47:14] (JobUnavailable) resolved: Reduced availability for job poolcounter_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:47:31] !log doc1002 - systemctl reset-failed [16:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:41] RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.65 ms [16:49:39] RECOVERY - Check whether ferm is active by checking the default input chain on db2104 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:50:03] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002889 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:50:03] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:03] RECOVERY - IPMI Sensor Status on ores1009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:50:03] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:50:03] RECOVERY - IPMI Sensor Status on mw2411 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:50:04] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:50:04] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2003 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:50:05] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:50:05] RECOVERY - IPMI Sensor Status on wdqs2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:50:07] !log ran failed codfw puppet agents [16:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:13] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:51:20] (03PS1) 10BBlack: Revert "Depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/812961 [16:52:37] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:55:03] (03CR) 10BBlack: [C: 03+2] Revert "Depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/812961 (owner: 10BBlack) [16:55:33] !log codfw dns repooled for front edge traffic [16:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:54] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2037.codfw.wmnet with reason: host reimage [16:56:21] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:30] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) 05Open→03Resolved All servers are back up, @MoritzMuehlenhoff I had to make the private1 vlan the native vlan [16:58:57] (03PS1) 10David Caro: wmcs: Add novafullstack alerts [alerts] - 10https://gerrit.wikimedia.org/r/813274 [16:59:28] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2037.codfw.wmnet with reason: host reimage [17:00:49] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) db1132 is not getting replication. Is that intentional? [17:02:15] (03PS1) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813275 [17:02:35] (03PS2) 10David Caro: novafullstack: remove leaked VMs test, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/813275 [17:02:57] (Device rebooted) firing: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:07:57] (Device rebooted) resolved: Device ps1-a5-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:11:48] (03PS1) 10Ladsgroup: labs: Stop writing to the old fields of templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813276 (https://phabricator.wikimedia.org/T312865) [17:13:42] (03CR) 10Ladsgroup: [C: 03+2] labs: Stop writing to the old fields of templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813276 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [17:15:02] (03Merged) 10jenkins-bot: labs: Stop writing to the old fields of templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813276 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [17:16:42] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7418 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:18:22] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2037.codfw.wmnet with OS bullseye [17:18:31] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2037.codfw.wmnet with OS bullseye completed... [17:21:04] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7275 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:22:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:24:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:24:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:26:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:28:34] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:29:52] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:30:30] (03PS1) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [17:32:34] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36261/console" [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [17:33:12] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [17:35:36] (03PS2) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [17:38:54] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [17:39:55] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [17:41:26] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [17:42:17] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [17:43:32] (03PS3) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [17:43:45] 10SRE, 10MediaWiki-Site-system, 10Patch-For-Review, 10SEO: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402 (10Omidxzzz) I have the same problem [[ https://webdon.ir/product-category/personal-appliance/health-care/category... [17:46:18] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [17:50:57] (03CR) 10Cwhite: [C: 03+1] "Good catch on the parse_qs detail. LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [18:04:10] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 13409 MB (37% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:10:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [18:12:20] RECOVERY - MegaRAID on db1176 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:13:27] (03PS4) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [18:14:29] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [18:16:42] (03PS5) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [18:19:14] (03PS3) 10JHathaway: lists: convert apache template to epp [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) [18:19:16] (03PS4) 10JHathaway: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) [18:19:52] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [18:21:03] (03CR) 10Jbond: [C: 03+1] lists: convert apache template to epp [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:21:33] (03CR) 10JHathaway: lists: convert apache template to epp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:26:13] (03PS6) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [18:26:31] (03CR) 10Jbond: lists: add apache security configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:31:29] (03PS7) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [18:32:20] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36264/console" [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [18:33:03] (03CR) 10Ori: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813294 (https://phabricator.wikimedia.org/T310644) (owner: 10Ori) [18:35:55] what's the deployment process for beta-cluster-only wmf-config changes these days? [18:36:57] ori: patience [18:37:09] merge it, jerkins will deploy it [18:37:23] some people say pull it onto deploy server and "deploy" for consistency.. others just say pull it [18:37:53] thanks (and hi) [18:38:17] (03CR) 10Ori: [C: 03+2] Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [18:44:07] I don't think jenkins-bot will merge it, it got into a bad state with this change earlier due to the outage [18:44:48] (03PS5) 10JHathaway: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) [18:45:07] can't manually submit either [18:46:42] (03PS8) 10Ottomata: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) [18:46:48] (03PS3) 10Ori: Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) [18:47:01] (03PS4) 10Ori: Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) [18:47:13] (03PS5) 10Ori: Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) [18:47:34] (03CR) 10Ori: [C: 03+2] Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [18:49:36] (03CR) 10JHathaway: lists: add apache security configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:49:42] (03Merged) 10jenkins-bot: Add Beta Wikifunctions to $wmgApprovedContentSecurityPolicyDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813262 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [18:51:17] cmjohnson1: what was the story with T312805? [18:51:17] T312805: restbase1025 down - https://phabricator.wikimedia.org/T312805 [18:52:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:53:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:53:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:54:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:57:20] (03PS1) 10Krinkle: Revert "Enable wgResourceLoaderUseObjectCacheForDeps for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813296 (https://phabricator.wikimedia.org/T113916) [19:09:01] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Eevans) >>! In T308331#8072918, @Cmjohnson wrote: > @RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry All your servers are moved, > > > @MoritzMuehlenhoff I am not able to s... [19:12:26] (03CR) 10Krinkle: [C: 03+2] Revert "Enable wgResourceLoaderUseObjectCacheForDeps for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813296 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [19:13:25] (03Merged) 10jenkins-bot: Revert "Enable wgResourceLoaderUseObjectCacheForDeps for all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813296 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [19:13:27] (03PS3) 10Jbond: P:aptrepo: install python3-apt required by validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/813251 [19:13:48] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for elastic1065.eqiad.wmnet [19:13:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic1065.eqiad.wmnet [19:16:06] (03Abandoned) 10Reedy: Enforce upload ratelimits on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582046 (https://phabricator.wikimedia.org/T248177) (owner: 10Reedy) [19:19:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:19:53] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on elastic2038.codfw.wmnet with reason: firmware update T312298 [19:19:56] T312298: Update NIC firmware on all Elastic PowerEdge R440 elastic hosts - https://phabricator.wikimedia.org/T312298 [19:20:07] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic2038.codfw.wmnet with reason: firmware update T312298 [19:20:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:20:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:20:33] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I3071c009c (duration: 03m 09s) [19:21:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:22:08] (03PS4) 10Jbond: P:aptrepo: install python3-apt required by validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/813251 [19:23:11] (03PS5) 10Jbond: P:aptrepo: install python3-apt required by validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/813251 [19:23:22] (03CR) 10Jbond: P:aptrepo: install python3-apt required by validate_cmd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813251 (owner: 10Jbond) [19:26:11] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I3071c009c (2) (duration: 02m 45s) [19:27:33] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2038.codfw.wmnet with OS bullseye [19:27:51] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye [19:30:49] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2038.codfw.wmnet with OS bullseye [19:30:56] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye executed... [19:31:13] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2038.codfw.wmnet with OS bullseye [19:31:21] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2038.codfw.wmnet with OS bullseye [19:31:25] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye [19:31:31] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye executed... [19:31:54] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2038.codfw.wmnet with OS bullseye [19:32:27] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye [19:32:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [19:34:37] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2038.codfw.wmnet with OS bullseye [19:34:45] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye executed... [19:35:01] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2038.codfw.wmnet with OS bullseye [19:35:10] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye [19:35:33] 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Papaul) [19:35:52] (03CR) 10Jbond: [C: 03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [19:37:18] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:50] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2038.codfw.wmnet with OS bullseye [19:38:57] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye executed... [19:49:53] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2038.codfw.wmnet with OS bullseye [19:50:02] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye [19:57:10] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:59:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:00:04] RoanKattouw, Urbanecm, and cjming: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220712T2000). Please do the needful. [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:07:36] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2038.codfw.wmnet with reason: host reimage [20:11:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2038.codfw.wmnet with reason: host reimage [20:14:20] (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me!" [alerts] - 10https://gerrit.wikimedia.org/r/813255 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [20:17:42] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/813254 (https://phabricator.wikimedia.org/T312817) (owner: 10Filippo Giunchedi) [20:28:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2038.codfw.wmnet with OS bullseye [20:28:57] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2038.codfw.wmnet with OS bullseye completed... [20:37:31] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:46:30] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:58:24] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:44] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:44:28] (03PS1) 10Zabe: Undeploy CongressLookup (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813338 (https://phabricator.wikimedia.org/T312894) [21:44:30] (03PS1) 10Zabe: Undeploy CongressLookup (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813339 (https://phabricator.wikimedia.org/T312894) [21:44:32] (03PS1) 10Zabe: Undeploy CongressLookup (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/813340 (https://phabricator.wikimedia.org/T312894) [21:50:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2039.codfw.wmnet with OS bullseye [21:50:41] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2039.codfw.wmnet with OS bullseye [22:11:26] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2039.codfw.wmnet with reason: host reimage [22:15:12] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2039.codfw.wmnet with reason: host reimage [22:17:47] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@45ae36d]: subgraph_and_query_metrics: Drop wiki from sparql event partition spec [22:19:52] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@45ae36d]: subgraph_and_query_metrics: Drop wiki from sparql event partition spec (duration: 02m 04s) [22:32:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2039.codfw.wmnet with OS bullseye [22:32:25] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2039.codfw.wmnet with OS bullseye completed... [23:22:15] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:50:27] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:59:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown