[07:04:35] 10DBA, 06Operations, 10ops-eqiad: db1051 disk is about to fail - https://phabricator.wikimedia.org/T149908#2770847 (10jcrespo) [07:19:00] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2770875 (10Marostegui) Thanks Papaul, I will reimage it then and watch closely to see if it fails again at some point. [07:23:55] 10DBA, 06Operations, 10ops-codfw: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2770879 (10Marostegui) dbstore2002 caught up. \o/ I am going to do a few tests to make sure it is fine and if so, on Sunday I will take a final snapshot of dbstore2001, move it to dbstore2002 and... [09:18:08] jynus: I just found out that db2034 (rc service) does not have partitions [09:18:15] So that is a mistake I believe [09:18:30] I will reimage it anyways and then clone it and make sure it has the partitions [09:18:44] it was just a fyi maybe you were aware of this issue [09:23:01] but db2034 crashed [09:23:26] which shard is it? [09:23:29] s1 [09:23:51] what about the other rc host, does it have it? [09:23:58] yes, db2042 [09:24:00] it does [09:24:00] because maybe it was reimaged [09:24:17] and that not take into account [09:24:23] yeah probably [09:24:25] if it tends to crash frequently [09:24:48] 10DBA, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2771101 (10MoritzMuehlenhoff) a:03Cmjohnson [09:25:32] maybe, I will reimage it and partition it [09:26:04] or clone it from somewhere with partitions (you suggested cloning it from db1052), we will see [09:31:33] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2755877 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2034.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimag... [09:59:31] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2771185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2034.codfw.wmnet'] ``` and were **ALL** successful. [10:00:00] \o/ [11:20:43] 10DBA, 06Operations, 10ops-codfw: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2771253 (10Marostegui) dbstore2002 looks good, stopping and starting slaves, the mysqld process and so forth shows no errors. I have seen that the tokudb plugin cannot be loaded ``` 161104 11:15... [11:22:25] ^that is a puppet bug [11:22:32] Oh [11:22:46] puppet==our puppetization [11:22:58] but sicerely, I would not bother fixing it at this point [11:23:09] I was thinking about that when I setup labsdbs [11:23:10] no no, that is what I said, that I am not worried about it [11:23:17] as it works fine on dbstore2001 [11:23:24] it is the huge pages configuration [11:23:26] I assume it was soemthing related to our versions [11:23:43] and/or the plugin registry [11:24:05] No, the huge pages are disabled [11:24:18] on *2002 I mean [11:24:20] I checked that [11:24:23] then toku is not loaded [11:24:41] no [11:24:47] that is a different error [11:25:06] but toku is compiled it [11:25:20] in [11:25:27] I thknk it is related to the fact that we are moving data from 10.0.27 to 10.0.22 [11:25:43] We cannot upgrade dbstore2002 to 10.0.27 [11:25:45] what? [11:25:56] dbstore2002 is trusty [11:25:59] why? [11:26:17] I guess because it was trusty already? [11:26:26] but you reloaded all its data [11:26:30] why not reimage? [11:26:41] it is the only moment when we can reimage! [11:26:41] I didn't think of that :( [11:26:56] I can take a snapshot and reimage [11:27:23] why a snapshot? [11:27:26] For some reason I assumed it was jessie [11:29:17] I can take a tar.gz, reimage and then put it back [11:29:22] it shouldn't take long [11:29:25] you asked me to delete /srv/sqldata [11:29:39] assume reimage whenever that happens [11:29:48] otherwise we cannot do it [11:30:10] how are we going to upgrade that if not? [11:30:38] I know, I assumed it was jessie already, so deleted the whole content to copy healthy content from dbstore2001 [11:30:39] and why would we want to move data from 27->22 [11:31:05] Because 2002 was corrupted [11:31:12] yes [11:31:20] Looks like we had a misunderstanding (and I should've checked if it was jessie) [11:31:21] so we delete it all [11:32:07] trusties == bad [11:32:11] :-) [11:32:17] :( [11:32:21] I will fix it then [11:32:29] we want to get rid of them [11:34:35] I will get rid of that trusty in dbstore2002 [11:35:07] I am not worried, I care of many hours potentially lost from your work [11:35:17] no, not really [11:35:31] copying things is an expensive work [11:35:32] Because the snapshot and transfer will be done on the background [11:35:47] I was planning to work till late today and probably tomorrow a bit, so no worries :) [11:35:56] why? [11:35:59] don't [11:36:15] you can do that next week, can't you? [11:36:44] yes, but I want the transfer to finish today [11:36:51] and if it does I will reimage it tonight :) [11:36:52] well, it finished [11:37:16] let the rest for next week [11:38:41] wasn't strange to fit .27 files into .22 (and potentially dangerous)? [11:39:48] that is the thing, for some reason I assumed both dbstore200X were the same (jessie+10.27) [11:39:57] ok, ok [11:40:14] then you just realize that [11:40:19] yeah [11:40:23] with the tokudb thing [11:40:27] both were trusty [11:40:39] we reimaged dbstore2001 [11:40:44] maybe it was even you [11:40:55] to jessie [11:41:20] and the idea was to reimage 2002 while data was on 2001 [11:41:44] so we can be eventually on jessie forever [11:41:54] I would say we reimage db2002 [11:42:18] then we do a copy 2001 -> 2002 shuting the first down [11:42:24] that should be fast? [11:42:49] yes, that is what I did last time [11:43:09] that should only take 2-4 hours, right? [11:43:18] yes, something like 4 or so [11:43:27] I was compressing tables in S4 in dbstore2001 even [11:43:32] ok, you can do that next week [11:43:51] no rush during the weekend, right? [11:43:57] ok ok :) [11:44:22] imagine the work if we have to reimage to jessie after that! [11:44:35] we may not even have space available [11:44:58] once in jessie with the right partitioning [11:45:05] yes that is true [11:45:09] we can do non-deleting reimages [11:45:19] and even in-place updates [11:45:43] but those old machines have a bad partitioning, in most cases [11:46:01] small /, etc. [11:46:24] small / outside of the LVM [11:46:29] I mean [12:30:35] there's haproxy failover icinga alerts for dbproxy1002 and dbproxy1007, known problem? [12:30:44] yes [12:30:47] I cannot ack [12:31:04] because I need to see it doesn't get worse [12:31:09] see SAL [12:31:38] (during slave reimage, I want to get notified the masted doesn't go down, too) [12:31:59] ok, thanks [12:32:01] moritzm, makes sense? [12:32:11] thanks to you [12:35:31] makes total sense! [13:44:15] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2771533 (10Marostegui) In the end after discussing it with Jaime and seeing some possible problems we have changed it to: ``` gtid_domain_id = <%= @server_id %> ``` That way the number is uni... [14:31:43] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2771700 (10Marostegui) Jaime and myself were wondering about what would happen if: 1) We have a new slave with a new gtid_domain_id 2) We need to clone that slave 3) We clone that slave from an... [14:38:56] 10DBA, 06Operations, 10ops-codfw: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2771723 (10Marostegui) [15:39:13] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2771723 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2002.codfw.wmnet'] ``` The log can be found in `/var/log/w... [16:02:22] marostegui, checking row D, more or less things are just a fraction of a total [16:02:39] but we may want to pool some additonal servers just in case [16:02:58] jynus: row D? you mean the racks rows? [16:03:08] Sorry I was caught off guard :) [16:03:10] yes, I was looking what we have there [16:03:17] let me see [16:03:18] sorry for the lack of context [16:03:42] jynus: eqiad or codfw? [16:03:48] eqiad [16:05:06] there are not many servers there no? [16:05:20] Is it fine to have dbproxy1011 and 1010 in the same rack by the way? [16:05:24] They are in D3 [16:05:24] D3: test - ignore - https://phabricator.wikimedia.org/D3 [16:05:35] XD [16:05:38] marostegui, yes because the service [16:05:55] uses dbproxy1001 and 1006, [16:06:00] 2 and 7 [16:06:03] etc. [16:06:16] ah right [16:06:30] so you were thinking about getting more servers in that row? [16:06:38] no, no [16:06:47] the maintenance that is going to happen there [16:06:59] seeing if we have a SPOF [16:07:17] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772062 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbstore2002.codfw.wmnet'] ``` and were **ALL** successful. [16:07:46] I think with the latest organization, we setup things more or less properly [16:07:52] Aah yes [16:07:54] Right [16:08:33] but if we can minimize errors knowing what is going to happen [16:08:44] maybe we can prepare for that [16:08:54] yes, if we need to pool in some more servers that is fine [16:09:09] You can probably speak better from experience to say if we might hit some capacity issues [16:13:14] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772079 (10Marostegui) ``` root@dbstore2002:/srv/sqldata# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.6 (jessie) Release: 8... [17:12:52] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2772209 (10jcrespo) I can confirm that: I ran `START SLAVE;` after re-imaging db2011, then `STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID = slave_pos; START SLAVE`, no issue on the slave. ```... [18:44:00] 10DBA, 06Operations: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10Dzahn) I agree that we should not have disabled notifications _without_ a comment on them, ideally a reference to a ticket every time. But it's ok to have them if they have a comment AND... [18:53:47] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2772598 (10Marostegui) Awesome!! Thanks for the confirmation Let's change the master of m2 next week and see what happens [19:24:30] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2772659 (10ksmith) Does this patch being abandoned mean that this issue is no longer fixed? [20:09:35] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2286195 (10AlexMonk-WMF) No, the proper version of the script was replaced in T138450, Ori's DNM version... [20:27:43] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772805 (10Marostegui) dbstore2002 is now running 10.0.28, mysql_upgrade went fine and tokuDB engine is loaded. The slaves are catching up too. ``` root@dbstore2002:/opt/wmf-mariadb10/bin#... [20:31:37] 10DBA, 06Operations, 10ops-codfw: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2772815 (10Marostegui) [20:31:39] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772814 (10Marostegui) 05Open>03Resolved