[08:24:51] labsdb1009.mgmt is down [08:25:42] yeah, I saw that yesterday and I thought it was related to the net management [08:26:08] probably my comment should be [08:26:13] it is still down :-) [08:26:22] I am adding a comment to the ticket [08:26:30] not a huge blocker at the moment [08:26:41] yeah, lets talk to chris [08:26:46] I am sure it is a loose cable [08:28:18] I can check if mgmt is up from localhost [08:29:22] I can query it from localhost, so it is functional [08:29:32] nice [08:30:33] times out remotelly [08:31:56] akosiaris: thank you very much for your extra icinga checks! [08:32:24] they make a huge difference on debuging issues [08:32:47] (by makeing them much easier) [08:35:22] by this time, we should have daily snapshots [08:37:01] Going to disable puppet on dbproxy1011 to do the haproxy changes to depool dbproxy1010 [08:41:49] +1 [08:42:05] will monitor the hosts for overload/lag [08:42:07] Just ran the wmcs-wikireplica-dns script, we'll see how long it takes to drain everything [08:42:26] and I can double check connections from the hosts, if you do it from the proxies [08:42:38] no worries, I will take care of it :) [08:43:03] ok, then I will focus on snapshots, of which we should have 2 automatic ones today [08:43:36] good! :) [08:47:25] I made a mistake, BTW, the new software doesn't rotate by default, so new backup have not been rotated (even if successful)- working on it [08:47:40] backup == snapshot? [08:47:43] or dumps too? [08:47:49] all :-) [08:48:07] I changed the software but not the conifg [08:48:09] *config [08:48:10] see? we had to deploy so we could catch those things! [08:48:35] the software was correct, the problem was the config was not altered accordingly [08:55:31] jynus: you 're welcome ofc, but which checks do you refer to specifically? [08:57:13] the mgmt host checks [08:57:27] all of them [08:57:30] ah, yeah those can be useful [08:57:37] glad you liked them [09:17:17] dumps on eqiad start on 2019-03-19 17:00:01 and end at 06:09:49 (13 hours) [09:18:00] with the new hw we should be able to reduce that to half [09:20:32] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127 (10Marostegui) [09:20:43] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) [10:12:06] dbproxy1010 is drained, going to upgrade + reboot [10:12:31] cool [10:15:33] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127 (10Marostegui) [10:15:36] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) [10:19:28] lots of writes on s7, schema change perhaps? [10:19:36] on codfw, yep [10:19:43] I !logged but of course SAL doesn't work :( [10:19:47] cool, just checking [10:19:55] yeah, also sometimes is diifcult to keep track [10:20:01] yeah, it is [10:20:02] sorry for the ping [10:20:11] not at all, better be safe [10:20:24] I am repooling dbproxy1010 [10:22:48] Cool, connections arriving to labs from dbproxy1010 already [10:32:37] labs labs labs [10:32:40] :-P [10:33:19] fair, but try saying WMCS 3 times in a row without your tong breaking ;-) [10:33:30] XD [10:35:24] I will wait to upgrade dbproxy1011 tomorrow, don't want to do both the same day [10:36:53] marostegui: BTW the SAL does work, only that it doesn't get replicated to wikitech: https://tools.wmflabs.org/sal/production [10:37:11] Ah! [10:37:16] Thanks :) [10:37:24] :-) [10:37:47] I was !logging because it is almost automatic for me anyways haha [10:39:07] also that error message from the bot makes one thinks the log is being lost. I always check the SAL in wikitech anyway [11:02:27] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) s7 eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1125... [11:02:31] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127 (10Marostegui) s7 eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1125 [] db1116 [] db1101 [] db1098 [] db1094 [] db1090 [] db1086 [] db1079 [] db1062 [11:26:10] elukey should check if db1107 writer appliocaton needs to be started again, there are some error that may be due to that [11:26:58] jynus: what errors? [11:27:02] I have restarted it yesterday [11:27:08] EventLogging overall insertion rate from MySQL consumer [11:27:23] maybe it is something else [11:27:26] ah yes that one, not sure why it is still alarming, need to check it [11:27:29] thanks [11:27:31] :) [14:17:23] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T218776 (10Marostegui) p:05Triage→03Normal a:03Papaul Can get this replaced? Thanks! [14:18:04] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [15:45:23] If I can get a review, so I can push this tomorrow morning: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497793/ [15:46:01] another? [15:46:13] thanks! :) [18:44:20] 10DBA, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Performance review of Extension:WikimediaEditorTasks - https://phabricator.wikimedia.org/T218087 (10kchapman) [19:01:32] 10DBA, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Performance review of Extension:WikimediaEditorTasks - https://phabricator.wikimedia.org/T218087 (10aaron) The wetede_rand index seems like it will need to be changed as I mentioned, given the size. Othe...