[06:39:48] 10DBA, 10Operations: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3807851 (10Marostegui) [07:08:23] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3807873 (10Marostegui) [07:10:21] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3807875 (10Marostegui) a:05Marostegui>03Cmjohnson This host is now fully ready to be decommissioned. [07:45:01] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3807901 (10Marostegui) db1096.s6 is now replicating [07:45:14] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3807902 (10Marostegui) [08:21:44] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3807925 (10Marostegui) [09:24:54] do you know what was the issue with labsdb1003? [09:26:13] I don't iknow, I saw andrew removing files (don't know which ones) [09:26:22] there was enough space in the lv [09:26:26] but I guess he didn't notice [09:26:44] that is all I know, I pinged him an hour later after he removed them but he didn't reply, so I guess he was already sleeping [09:31:13] pcs are at an unconfortable 74% usage [09:31:54] I wonder if that is *after* reducing the TTL [09:35:59] indeed TTL is 21 days [09:44:21] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10MediaWiki-Platform-Team (MWPT-Q1-Jul-Sep-2017): WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3808053 (10jcrespo) @TStarling, @Anomie after 4 months it is unlikely that the rests of a bug are still here- with 21 days (redu... [10:06:33] so there is this notifications disabled I cannot remember if I put, I think I did [10:06:54] but please confirm I am not stepping into you: s8 on db2085 [10:06:54] for which host? [10:07:05] Nope, not touching that host at all [10:07:06] thanks [10:07:08] will delete now [10:07:11] and pool it back [10:07:12] cool! [10:07:29] sorry I am being very conservative an bothering you [10:07:34] too many things to track! [10:08:25] no no [10:08:27] not at all [10:08:30] better be safe than sorry [10:09:22] I guess we can now relocate db2034 and db2034, but not sure in which ways [10:09:29] *db2041 [10:09:33] *db2042 [10:10:24] we said for misc, no? [10:10:32] we have to decomm db2016 too [10:10:34] yeah, but [10:10:44] yes it is complex because more movements [10:10:56] some hosts go away [10:10:59] some get reasigned [10:11:08] should we maybe, focus on decomm the old ones [10:11:12] so we can have a clear picture? [10:11:16] of how it will look like? [10:11:21] and take it from there? [10:11:27] yeah, but that will depend on these [10:11:38] as in, we decom, now we have to replace, etc. [10:12:07] yeah, that is what I meant, if we check what needs to be decommissioned, we can know how and with which hosts we have to replace [10:12:08] On the other hand, I am not sure I want to focus fully on that [10:12:11] and what will be left for misc [10:12:21] while we have not yet figured out eqiad [10:13:06] For eqiad, we can actually take db1056 already to build it as misc btw [10:13:26] db1055 will also be coming in the next few days available for misc [10:13:52] maybe we can build db1056 already to replace the faulty master: https://phabricator.wikimedia.org/T166344 [10:14:36] so as you can see, we already have work with eqiad, I would focus on solving that first [10:14:46] yeah, agreed [10:14:47] based on immediate needs [10:15:03] if we happen to perform maintenance on codfw, we can do things there [10:15:05] for me that master with faulty HW would be a priority [10:15:11] which one? [10:15:18] the ticket I pasted above, m1 master [10:15:24] ah, yes [10:15:31] at least to have a host ready [10:15:34] in case it fails during xmas [10:15:38] db1052 is not also in perfect condition [10:15:58] should we enable statement on db1067? [10:16:01] to have it ready just in case? [10:16:14] yes, I was thinking of that [10:16:24] having the vslows of all hosts in statement [10:16:33] +1 to that [10:16:44] and the next one in row for replication [10:16:55] I think there is at least one replica set that that doesn't happen [10:17:39] yeah, agreed [10:17:56] not on s3, but actually that is ok there [10:18:09] as I would pool db1077 before db1072 [10:18:45] s4 I think is the one we should change [10:19:15] yeah [10:20:36] so leaving as candidates: db1067, db1053, db1077, db1064, db1051, db1030? db1069, db1063 [10:20:48] buff, we need more decoms there [10:20:50] still [10:21:01] not a lot more I think, just 2 servers in s7 no? [10:21:06] (speaking from my mind) [10:21:38] db1055 and db1056 are or will be available [10:21:59] yes, db1056 is already available [10:22:02] db1055 soon too [10:22:16] we will have at least a big one from the new s5 with only dewiki, as it is overprovisioned [10:22:27] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3808169 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1098.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017120... [10:22:33] but we need to decom still db1030, db1034, db1039 [10:23:08] so db1056 to m1, db1055 to s7? [10:23:29] actually db1034 may not need replacement [10:24:06] I will check later the state of the misc [10:24:17] I have to finish other stuff first [10:26:31] yeah, I don't think db1034 needs a replacement [10:26:45] (for the record, before I forget, I fixed dbstore1002 on sunday morning - replication broke there again) [10:26:52] I saw it [10:26:57] I fixed it on saturday! [10:28:00] jesus... [10:29:36] (twice) [10:29:45] buf really? [10:29:48] what a pain [10:30:03] stop and reimporta page_props [10:30:14] and another one [10:30:20] on both wikis [10:30:29] page props is the one that failed on sunday I think [10:40:12] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3808257 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1098.eqiad.wmnet'] ``` and were **ALL** successful. [10:41:20] \o/ [10:47:08] there is also x1 that we have to replace [10:47:29] both servers [10:47:51] yeah, db1031 has faulty bbu too [10:47:56] I thought about db1055 for that one [10:49:33] we actually need 2 [10:50:49] we can remove maybe db1066 from s1, and give some more api weight to another big host [10:58:12] I want to try to deploy 394541 before it is too late [10:59:08] large patches doing refactoring get outdates soon [10:59:19] sure [10:59:39] we can delay the meeting, no problem for me at all [11:00:09] no, let's do that now [11:00:19] ok [13:00:38] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3808752 (10Marostegui) @Anomie s3 is now done - you can proceed with your tests. Keep in mind that labs isn't filtered and the views... [13:00:50] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3808754 (10Marostegui) [13:01:00] So we mark as Declined: https://phabricator.wikimedia.org/T180694 ? [13:05:53] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3808792 (10Marostegui) db1098.s6 is now replicating. [13:41:12] labsdb1004 replication broken [13:42:31] it broke around the time 1005 was upgraded [13:42:57] Ah, I know why [13:43:05] because labsdb1005 is now running ROW replication [13:43:07] binlog format [13:43:15] or was that the case before? [14:24:03] it should be before, too [14:24:27] what did it break? [14:41:44] https://bugs.mysql.com/bug.php?id=61073 interesting [14:46:57] the database is small, I am going to ignore and reimport [14:49:10] something strange happene there, like some problems with alter on innodb + replication [14:49:35] hopefuly that never happens to us on production, it is scary [14:50:25] https://bugs.mysql.com/bug.php?id=56226 [14:51:11] it could be something related to old tables + upgrade + alter [14:51:48] things are being recreated now, though- I will reimport and later recreate the tables on the master to make sure it does not happen again [14:52:09] luckyly, the tables are small [14:59:44] labsdb1003 lag, is that expected? I suppose yes? [15:02:08] I would like to set labsdb1004 as read only, and pt-heartbeat on the master [15:02:15] adding to the todo [15:06:55] I think I fixed labsdb1003, there were some queries stuck for 20 hours [15:28:01] I had a ./reimport_from_labsdb1005.sh on labsdb1004 which sped up the process :-D [15:28:13] I wonder why I had that :-) [16:18:45] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3809495 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [16:21:43] Just read those two bugs….scary, really scary [16:22:32] everything should be ok now [16:22:41] I didn't know that script [16:22:42] taking a look [16:22:43] I updated our todo with them [16:22:43] <3 [16:30:00] I will use it to reimport page_props to dbstore1002 [16:30:37] yeah, I will need to bookmark that script [16:30:38] hehe [16:31:01] it needs proper fixing and checks, error states [16:31:28] also it works for non-important hosts with inactive tables only [17:16:58] 10DBA, 10MediaWiki-Configuration, 10Operations, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3809840 (10jcrespo) 05Open>03declined We are happy with the configuration on both eqiad and codfw, we do not need to test testwiki... [18:02:58] marostegui: have to go but maybe I will fix dbstore later or tomorrow [18:03:05] do not want to go with it broken [19:05:53] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3810307 (10Anomie) >>! In T174569#3808752, @Marostegui wrote: > @Anomie s3 is now done - you can proceed with your tests. > Keep in... [20:24:16] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10MediaWiki-Platform-Team (MWPT-Q1-Jul-Sep-2017): WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3810610 (10Anomie) We could still do {T181846} to reduce the cache fragmentation. It shouldn't be a lot of work left to do that.