[05:16:42] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4189150 (10Marostegui) a:03Papaul @Papaul can we get a new disk for this one? Thanks! [05:43:09] I have done all the steps prior the actual failover [05:43:25] I can see [05:43:49] we can do step 16 in advance, too [05:45:36] I would prefer to wait, db1069 keeps advancing on its binlog [05:46:15] yes, but with server_id db1055 events [05:46:39] which get discarded automatically by the same server_id [05:46:57] yep, I know, I was just being extra careful :) [05:56:38] let's move to operations then? [05:57:37] ok [06:13:31] 10DBA: Decommission db1055 - https://phabricator.wikimedia.org/T194118#4189202 (10Marostegui) p:05Triage>03Normal [06:25:53] 10DBA, 10Patch-For-Review: Decommission db1060 - https://phabricator.wikimedia.org/T193732#4189224 (10Marostegui) [06:40:15] 10DBA, 10Patch-For-Review: Decommission db1060 - https://phabricator.wikimedia.org/T193732#4189237 (10Marostegui) [06:46:05] 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4189246 (10jcrespo) a:05jcrespo>03RobH [06:47:02] 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4178332 (10jcrespo) This is ready for dc-ops. Robh you may want to update the template here? Is there a "last version" somewhere? [06:47:28] 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4189252 (10jcrespo) p:05Normal>03Low [06:53:58] 10DBA, 10Patch-For-Review: Decommission db1060 - https://phabricator.wikimedia.org/T193732#4189258 (10Marostegui) [07:05:40] 10DBA, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission db1060 - https://phabricator.wikimedia.org/T193732#4189275 (10Marostegui) a:05Marostegui>03RobH This is ready for @RobH and DC-Ops to take over [07:08:44] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189282 (10Marostegui) What if we temporarily convert db2092 (s1) to codfw sanitarium, copy db1116's data to db2092. Once the n... [07:13:31] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189285 (10jcrespo) But one host will not be enough, we need 2. [07:15:29] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189287 (10Marostegui) >>! In T190704#4189285, @jcrespo wrote: > But one host will not be enough, we need 2. Yes, but for that... [07:51:30] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db209... [08:01:23] https://twitter.com/vinchen13/status/988994244999233537 [08:01:53] apparently doesn't work on compressed [08:02:02] :( [08:02:19] It would simplify a lot our schema changes if it is only a metadata change [08:02:26] but if it doesn't work on compressed, that is a big drawback [08:04:07] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#4189395 (10jcrespo) 05Open>03Resolved a:03jcrespo Done, no maintenance code yet for database maintenance, but that is still on terbiu... [08:12:11] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189416 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ``` and were **ALL** successful. [08:36:38] marostegui: / jynus hi - I see Grename is faster - has anything changed in the process? [08:37:13] Hauskatze: No idea to be honest, on renames we are just spectators :) [08:38:17] Hauskatze: I don't think anything has changed yet [08:38:45] jynus: I think that it might be the redis -> kafta jobqueue migration of the job [08:38:58] because I see jobs are now run on several wikis at once [08:39:02] not one by one [08:39:17] or they're so fast that it appears they're done on several at once [08:39:18] no idea [08:39:33] they migrated the localrenamejob to kafka the other day [08:41:40] yes, but for the upcoming changes it should be instant [08:41:48] no need for the job queue [08:42:21] that'd be awesome [08:42:43] it will open for name changes to any account [08:42:56] no matter the number of contributions [08:43:09] would that open the gate for usermerge as well? [08:43:20] I don't think so [08:43:34] those would require a job still [08:43:35] I'm seeing that for some accounts being able to --locally-- merge accounts in some wikis would be great [08:43:55] the change is that accounts will be identified with a number [08:44:02] last one was Niharik-a which was forced to 'abandon' an account with ~150 edits [08:44:06] and the number can be any string [08:44:26] I guess technically there could be 2 accounts with the same name [08:44:33] but that would not work well [08:45:22] so this would not touch that, but that doesn't mean it cannot be proposed [08:49:03] usermerge support for wmf is proposed at T15658 [08:49:04] T15658: skin.php: selected skin does not exist anymore causes error - https://phabricator.wikimedia.org/T15658 [08:49:07] ehm [08:49:16] T156584 [08:49:17] T156584: Full UserMerge support for WMF wikis - https://phabricator.wikimedia.org/T156584 [08:58:05] 10DBA: Meta ticket: Deploy InnoDB compression where possible - https://phabricator.wikimedia.org/T150438#4189631 (10Marostegui) [08:58:09] 10DBA, 10Patch-For-Review: Set barracuda InnoDB file format as the default configuration everywhere - https://phabricator.wikimedia.org/T150949#4189629 (10Marostegui) 05Open>03Resolved innodb_strict_mode has been enabled on config everywhere. It will be picked up during restarts. A few hosts were also chan... [09:48:09] 10DBA: Decommission db1055 - https://phabricator.wikimedia.org/T194118#4189743 (10jcrespo) a:03jcrespo [10:14:22] 10DBA, 10Patch-For-Review: Decommission db1055 - https://phabricator.wikimedia.org/T194118#4189202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1064.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201805081013_jynus_12330.... [10:19:38] Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db2092 is not replicating? [10:54:04] Several things: is it okay with increasing the size of ores_classification table on enwiki? Anything I should consider before moving forward? The wbc_entity_usage table on commonswiki is also growing, everything seems fine for now but let me if anything is getting unhappy. Last but not least, it would be great if you take a look at this patch: https://gerrit.wikimedia.org/r/#/c/430943/ [10:54:14] jynus: marostegui ^ [11:08:23] Also this: https://phabricator.wikimedia.org/T191391#4154763 [11:24:51] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Apply schema changes to an isolated database and examine the results - https://phabricator.wikimedia.org/T191391#4190019 (10jcrespo) The indexes seem disproportionally large compared to the data.... [11:47:09] 10DBA, 10Patch-For-Review: Decommission db1055 - https://phabricator.wikimedia.org/T194118#4190049 (10jcrespo) [12:45:15] jynus: should I let arzhel now that the blocker of x1's master is done for the switch maintenance? [12:45:37] sure [13:09:47] 10DBA, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata, 10Multi-Content-Revisions (Structured Data Commons): Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#4190292 (10CCicalese_WMF) [13:16:54] hey folks [13:17:07] how did the x1 switchover go today? [13:17:21] anything worthy of note, or uneventful? [13:17:25] all good [13:18:00] awesome! [13:18:01] read only time was more or less: 6:01:30 to 6:05:30 [13:18:02] nice work :) [13:25:53] marostegui: phabricator search is no longer working [13:26:20] do you have any hint/tag/ of how to search the tasks you created for smart disk errors? [13:26:35] I am searching for db1064, but it gets no results [13:26:42] but it is phabricator [13:27:51] :( [13:28:00] Let me see if I can find them [13:28:14] e.g. if I search for db1055 is returns no ticket [13:28:31] but I have been working on db1055 decommission, with that name, all day [13:28:53] I filed https://phabricator.wikimedia.org/T194154 [13:29:03] oh wow [13:29:10] true, if I look for it, it doesn't show up [13:29:19] search of phabricator is not a big deal [13:29:23] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4190417 (10Marostegui) db2092 is now a temporary multi-instance sanitarium host in codfw, replicating the same sections as db11... [13:29:27] except if it doesn't find host names [13:30:38] This is all I can find: https://phabricator.wikimedia.org/T190035 [13:30:49] and then we created related tasks [13:31:23] maybe db1064 wasn't created? [13:31:35] I think that is old [13:31:45] not the latest batch [13:33:30] No, it wasn't [13:33:34] It was not created [13:33:37] I see it has 2 disks failed [13:33:39] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4190471 (10Marostegui) p:05Triage>03High a:03Cmjohnson @Cmjohnson this host has 2 disks with smart alert. I have manually failed disk #9, let's change that one first, let it rebuild and then we can man... [13:33:55] Not failed, but with a smart alert [13:34:35] https://phabricator.wikimedia.org/P7073 [13:34:49] yes [13:35:03] db1063 is fixed [13:35:07] and db1073 is on its way [13:35:42] db1064 was left on its own because it was one of the many s4 slaves, so whenever the disks fail, we'd replace them [13:36:15] ok, I just wanted to take some of the good disks from the old servers [13:36:49] yeah +1 to that! [14:06:23] 10DBA, 10Patch-For-Review: Decommission db1055 - https://phabricator.wikimedia.org/T194118#4190571 (10jcrespo) [14:06:59] 10DBA, 10Patch-For-Review: Decommission db1055 - https://phabricator.wikimedia.org/T194118#4189202 (10jcrespo) This is ready to be decommed, just in case we will wait a few days before sending it to #dc-ops [14:55:30] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4190864 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete [14:56:53] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4190868 (10Marostegui) Thanks! ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding) ``` [14:59:17] jynus: thanks for the reivew, by the context I mean the amazing change_tag table: https://phabricator.wikimedia.org/T185355 [14:59:42] by context I mean "how those index are supposed to be used" [15:00:00] and this patch (which will be redundant once mine gets merged: https://gerrit.wikimedia.org/r/#/c/334337/) [15:00:03] weird indexes for me require weird justifications [15:00:21] which would be ok if those exist, but in most cases are mistakes [15:01:02] I see, the first one is for Special:Tags and ability to sort change tags based on number of use cases [15:01:14] I can find the exact query but it might take some time [15:01:16] "on number of use cases" :-) [15:01:30] you just need one example and I will be happy [15:01:46] but I mention that in most cases, normally you select by name and join [15:01:54] and the compound index may be better [15:02:14] "this column is pretty selective" [15:02:17] https://en.wikipedia.org/wiki/Special:Tags [15:02:31] ctd_count is going to give out the last column [15:03:04] if it doesn't have more than 3-4 values, indexing a column with so few balues just doesn't work [15:03:19] unless it is part of a larger index [15:03:43] add one select justifying those and I will be happy [15:03:48] one each [15:04:12] https://gerrit.wikimedia.org/r/#/c/430943/1/maintenance/archives/patch-change_tag_def.sql [15:04:19] last comment [15:06:57] for the first index "SELECT * from change_tag_def ORDER BY ctd_count DESC LIMIT 50;" <- This will give out the most used change tags (e.g. visual editor) [15:07:14] ok [15:07:22] I was just suggesting to increase it [15:07:33] in case something else was selected, but it is ok [15:07:38] if it has no where clause [15:07:48] what about the other? [15:08:42] for the second part "SELECT * from change_tag_def WHERE ctd_user_defined = 1 LIMIT 5" <- This gives out the the user defined tags which are pretty rare (5 tags in 200 for enwiki) [15:09:17] ok, so you will not use it for ctd_user_defined = 0 ? [15:09:31] no, it doesn't matter [15:09:38] because that is the issue, if there are too many results, it doesn't work [15:09:51] OTOH this table probably won't grow larger than 10K rows (pretty pessimistic estimation) [15:09:57] ok, then [15:09:59] so it's pretty small table [15:10:05] so that is needed for context [15:10:10] I did not know that [15:10:27] in that case high optimization is secondary [15:11:03] as I said, I had to mention weird things without context [15:11:20] \o/ It's more like site_stats table which will get lots of read and writes but I'm handling that part in the codebase [15:11:36] by putting things in memcached and flushing it out from time to time [15:11:52] it is ok [15:11:59] but the table will stay pretty small and makes change_tag table pretty smaller too [15:12:31] jynus: regarding the context, I assumed you know the situation with the change_tag table fully [15:12:38] that was my bad, sorry [15:13:26] no, basically, my life is being added to random tickets I know nothing about [15:13:44] :))) It's partially my fault [15:13:52] keep that in mind [15:14:03] marostegui: db1116:s3 replication failed [15:14:27] or is it you? [15:14:41] nope [15:14:42] checking [15:15:04] probably not unbreak now? [15:15:10] no [15:15:16] as I think it is not in production [15:15:20] but it is weird: Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-171970637-5477565337, which is not in the master's binlog' [15:15:45] 171970637-5477565337 is not coming from the master [15:16:06] It should have been 171970637-171970637 [15:16:47] could it be an alter, or someting else? [15:17:04] there are not ongoing alters on s3 at the moment [15:18:13] db2092 which was cloned from db1116 didn't have that error and replication is flowing :o [15:18:20] and it is replicating from the same master [15:18:45] so, something that only hit db1116? [15:18:49] I don't think 5477565337 is a server id we can generate [15:19:04] at least not on eqiad [15:19:32] maybe on codfw [15:19:54] did you reset all on cloning? [15:20:00] on db2092 yes [15:20:09] oh [15:20:15] this was not even setup now [15:20:21] but a few days ago? [15:20:23] yeah [15:20:30] today it was only stopped to clone it to db2092 [15:21:06] maybe the issue existed but only showed on restart [15:21:14] /replication restart [15:21:40] and is db2092 already ahead of it or still behind? [15:21:46] already ahead [15:21:49] and replicating from the same mater [15:21:50] master [15:21:57] that is very strange [15:22:01] I mean, I have the cure [15:22:19] but given we are not in a hurry I would investigate more tomorrow [15:22:27] yeah [15:22:29] agreed [15:22:40] not touch it for now [15:22:57] but it must be something not coming from its master (db1072) [15:23:06] because it would have broken db2092 too [15:23:22] I don't know [15:23:32] I will ack the alerts [15:23:55] thanks [15:24:01] and honestly, I would even rebuild it, given it is only 1 section [15:24:04] yeah [15:24:08] I will do so [15:24:11] rather than risk issues in the future [15:24:23] I don't know [15:24:41] I don't care about the issue, but I would like to know why it happened [15:24:49] oh [15:25:00] did you import from db1095? [15:25:09] yep [15:25:16] maybe there was something weird about multisource [15:25:26] that somehow creates issues [15:25:28] and only shows on restart maybe? [15:25:32] could be [15:25:39] it fits with the multisource issues [15:25:42] yeah [15:25:44] totally [15:25:53] maybe we can do a reset all slave [15:26:03] anyway, I am going, we can have a look tomorrow without haste [15:26:32] but why did it only happen in s3 and not the rest? [15:27:27] but yeah, it could be related to multisource [15:28:05] I did a test on db2092, stopping and starting mysql for s3 and replication is fine [15:29:01] 10DBA, 10CheckUser, 10Patch-For-Review: Create index for cu_agents in cu_changes table - https://phabricator.wikimedia.org/T147894#4190974 (10Huji) @jcrespo said on gerrit: "The idea seems sane, but I am concerned about the number of indexes this table has- it amounts to half of its total size right now (5 o... [15:32:39] it is definitely related to multi-source, that 171970637 id is from db1052 (enwiki master) so nothing to do with s3 [15:32:46] 10DBA, 10CheckUser, 10Patch-For-Review: Create index for cu_agents in cu_changes table - https://phabricator.wikimedia.org/T147894#4190981 (10jcrespo) well, what I mean is that for short term this can be deployed as is, but tables with many indexes and trying to apply all kinds of filters normally need more... [15:35:22] so we should probably disable gtid, reset slave all, do some clean ups on those gtid tables and start it again [15:38:08] I am going to disable gtid for s3 [15:38:14] on db1116 [16:05:19] 10DBA, 10Operations, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191109 (10jcrespo) [16:06:58] 10DBA, 10Operations, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191122 (10jcrespo) @Vgutierrez suggested using https://github.com/vstakhov/hpenc , which I don't think is a bad idea at all- it would just change some of the executions of openssl and netcat... [17:13:27] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Apply schema changes to an isolated database and examine the results - https://phabricator.wikimedia.org/T191391#4191406 (10Ladsgroup) That is very valid. I looked into indexes: ``` test_user@db2... [17:26:33] 10DBA, 10Operations, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191453 (10jcrespo) The recommended cipher, which is an easier change, is chacha20 or, alternatively, AES-GCM rather than the randomly selected one on the commit. [18:23:54] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4191618 (10Marostegui) [18:24:21] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4188843 (10Marostegui) The disk has failed to rebuild, can we try another one?: ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed) ``` Thanks! [18:27:18] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4191631 (10Marostegui) 05Open>03Resolved Disk #9 finished rebuilding: ``` root@db1073:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Device(Encl-32 Slot-9) is not in rebuild process Exit Code: 0x0... [18:45:52] Tomorrow I will stop replication on db1116, reset slave, clean gtid table, and start replication again with gtid, that should be enough to clean the gtids as they will not be injected again [18:46:03] If that doesn't work....i will be in the mariadb conference XDDDDD [18:46:26] But I think that should work [18:46:44] Interestingly I disabled gtid, let it replicate, then enabled gtid again and it had no issues [18:46:47] So it is weird [19:00:38] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194197#4191793 (10Marostegui) p:05Triage>03Normal This disk was manually failed to get it replaced and clear the SMART alert. It has already been swapped by Chris, and it is rebuilding: ``` root@db1073:~# meg... [19:34:28] 10DBA, 10CheckUser, 10Patch-For-Review: Create index for cu_agents in cu_changes table - https://phabricator.wikimedia.org/T147894#4191898 (10Huji) Understandable. The question is, is the CU tool worth these extra measures? It is used by a very small community of users, sporadically, so a slight latency migh...