[05:27:08] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) >>! In T204006#4662552, @kaldari wrote: > @Marostegui - I've heard from multiple people about unexpected fires delaying Ops/DBA w... [05:28:32] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) 05Open>03Resolved a:05Banyek>03Papaul This is no longer about dbstore2002 but about db2042, so let's follow on that task: T202051 dbstore2002 is good for now, so let's close this and re-open if nec... [05:32:19] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) >>! In T205865#4657924, @hoo wrote: > ```wikiadmin@db1109(wikidatawiki)> SELECT * FROM in... [05:36:51] 10DBA, 10MediaWiki-Special-pages, 10Datacenter-Switchover-2018: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages - https://phabricator.wikimedia.org/T206592 (10Marostegui) >>! In T206592#4658582, @Bawolff wrote: > Im away th... [05:38:58] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) @Banyek please double check the key purge has finished on mwaint1002 and keep on with the rest of pending things to do here. Probably... [06:35:53] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) @hoo after db1109 has been recloned (and now has compressed tables): ``` root@db1109.eqia... [06:36:08] addshore: ^ [06:51:57] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) As some of the pooled hosts has already been reimaged we can already see the production dis... [06:51:59] marostegui: thats aweome [06:52:20] it looks like the reimages hosts are indeed as fast as the codfw servers for the dispatch queries [06:52:25] *reimaged [06:52:31] so it could be the compression! [06:52:40] thats amazing [06:53:12] not really! but at least we found something! [06:53:18] yup [06:53:34] its more amazing to me the difference it makes [06:53:50] yeah [06:53:51] going from 2.5 minuites of average lag for the dispatching process to just about 10 seconds [06:53:53] which is the first time I see that [06:54:02] Like, such big difference [06:54:05] yup [06:54:09] I also wouldn't discard the bug I posted [06:54:20] the bug? [06:54:25] yeah, the mariadb bug [06:54:52] oh *reads the comment above* [06:54:54] with the optimizer [06:55:33] aaah, i see [06:56:07] We believe we hit that bug at https://phabricator.wikimedia.org/T197486#4293781 already [07:15:27] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) Another test with db1104 before and after compressing: ``` root@db1104.eqiad.wmnet[wikida... [07:15:31] addshore: ^ [07:16:37] ::D [07:16:41] this makes me so happy [07:30:21] I need manually downtime the db2096 until the reimaging happens (I won't merge the patch about productionize it, until the /srv partition is not ready) [07:30:28] marostegui: can I proceed? [07:30:38] banyek: sure, give it like 2 days of downtime or something [07:30:57] 👍 [07:31:43] then I proceed to reimage [07:32:11] have you run puppet on the install hots? [07:32:12] hosts? [07:32:14] so they get the change? [07:33:00] I didn't merged yet [07:33:12] but I planned it after mergem [07:33:16] *merge [07:33:16] once the reimage process goes, I would suggest you switch to the parsercache task (check my last update) [07:33:23] ok [07:33:38] so you don't waste time just waiting for a server to install :) [07:45:09] I start the reimaging now [07:45:16] ok [07:49:20] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by banyek on neodymium.eqiad.wmnet for hosts: ``` ['db2096.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20181015... [07:51:32] addshore: https://phabricator.wikimedia.org/T205865#4665200 I love your drawing skills hahahahhaa [07:51:35] <3 [07:51:45] :D [07:52:01] * marostegui will never show his [07:52:20] I always wanted to get a tiny little graphics tablet to improve my on screenshot doodeling :D [08:00:29] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) on checking the screen '91243.parsercache' on mwmaint1002 I can confirm that the key purge - I proceed [08:01:31] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) a:03Banyek [08:01:43] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [08:02:10] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Sounds good! Also as T206740#4659202, let's create a separate task for the replication check addition, so we can just focus on the imm... [08:06:27] 10DBA, 10User-Banyek: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Banyek) [08:07:59] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [08:08:37] First I set back the binlog siszes to normal [08:08:46] cool [08:08:51] log it ;) [08:09:21] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2096.codfw.wmnet'] ``` and were **ALL** successful. [08:09:24] and then flush logs - before stoppting the cleaner [08:09:29] ^ yay that [08:09:40] nice [08:10:23] the cloning itself shouldn't take long as it is not a big dataset [08:10:51] sec. [08:17:09] back [08:19:48] so addshore's maintenance run broke any further possible recovery, right? https://phabricator.wikimedia.org/T206743#4662028 [08:21:03] yeah, I believe now we can just only rebuild the hosts, which I am currently doing https://phabricator.wikimedia.org/T206743#4658362 (db1104 is almost done) [08:21:27] for the sanitarium master this is what I thought we could do [08:21:42] the ~200 articles in the comment will need some manual recovery or maint script to merge the revisions together if we decide we want to do that [08:21:53] or, as the list is only ~200 we might just leave it to the community [08:21:53] - remove wikidata views on labs server (send an email to announce it), use mydumper+myloader on db1087 and let it replicate [08:21:59] -check private data on labs once it is finished [08:22:02] - create the views again [08:22:07] I don't want to risk any possible leakage [08:29:27] after reimaging d 2096 now we have the /srv partition: [08:29:29] ```/dev/mapper/tank-data 3.6T 3.7G 3.6T 1% /srv``` [08:29:36] great [08:29:46] I start the recloning before continuing the PC* hosts [08:29:52] (it will take time [08:29:54] (it will take time) [08:29:58] sounds good [08:30:17] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) @Pigsonthewing Please don't get worry- as I said at T206743#465... [08:32:12] is there any reason not to depool/stop and use db2044 as a donor for the cloning? after I finished the puppet run parts? [08:32:26] **** 2033 **** [08:32:34] s/2044/2033/g [08:32:57] banyek: I would use db2069 because it is the candidate master and because db2033 has a broken BBU (or had a broken BBU) [08:33:35] 2069 then [08:33:38] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) I wrote a little crappy but effective query that I can just ru... [08:33:39] you could even use the backups if you wanted to, up to you [08:33:46] but before that first time puppet run! [08:33:51] (and merge) [08:34:04] marostegui: ^^ i ran my query against all of the shards [08:34:32] addshore: thanks a lot [08:35:04] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [08:39:27] marostegui: with the s8 redeployment, could it be that some schema changes are undone? [08:39:42] jynus: nope, all the ones were applied already to s8 [08:39:46] s8 codfw I mean [08:40:09] actually, we might have "done" some in eqiad [08:40:10] haha [08:40:14] let me check [08:41:04] The only one I thought it was already done in codfw and eqiad [08:41:07] for s8 [08:42:55] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) Now that I see it we are even more than 50% done, as we have also done lots of sections in both DCs already, so only pending 4 se... [08:51:18] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) @Addshore The main task now is to check pages and pages from re... [08:53:02] jynus: is the reimage of the slaves that serve mw traffic all done? [08:53:12] addshore: no, that needs to happen first [08:53:15] ack [08:53:43] addshore: I hope to only have db1087 (vslow) to be done by the end of today [08:53:50] But I need to coordinate with cloud team [08:53:57] I will send an email once that is the only host pending [08:54:14] and then we need to think start thinking about the master failover [08:54:34] we can do the master and db1087 manually [08:54:57] ack [08:55:07] jynus db1087 for all the tables? [08:55:26] not a full import, just a diff [08:55:30] we already have the query to list all of the pages that will require some intervention as they have been edited since in one of the comments from friday [08:55:43] addshore: how large is it? [08:55:52] jynus: but for all tables? [08:55:57] yes [08:56:07] isn't that a lot more work than mydumper+myloader? [08:56:16] jynus: it is https://phabricator.wikimedia.org/T206743#4661906 so on friday it was 248 pages [08:57:03] 509 now [08:57:07] I guess MCR will fix that [08:57:19] as it will only add sections? [08:57:25] and not edit the whole page? [08:57:34] lots of the wikitext pages will not need fixing as they are only report pages that are generated every day or week anyway [08:58:03] I am mostly concerned about mainspace pages [08:58:15] It should also be pretty easy for the community to write a bot to fix the other pages if we identify them, as the diffs can just be re applied onto the current version of the page [08:58:49] do you think that list will grow after your edit? [08:59:00] after my edit? [08:59:15] https://phabricator.wikimedia.org/T206743#4662028 [08:59:23] well, it will always grow [08:59:49] yes, it will always grow, but now users won't be managing to do nuexpected things [08:59:50] but maybe we can filter the ones that did not delete any content automatically [08:59:51] *unexpected [09:00:33] well, the joy of this essentially being edits to a json structure in the main namespace means its should be pretty easily recoverable if the wikidata community decide they want to [09:00:55] edit conflicts are far easier to solve there than with wikitext or something else unstructured [09:01:08] do they have access to the underlying json? [09:01:25] does the apy provide that? [09:01:46] oh, actually the api doesn't provide the old revisions >.> [09:01:59] but we could make it [09:03:05] ok, let me talk to manuel and see how I can finish the reimports [09:03:26] ack! [09:04:36] (messages supressed on db2069) [09:05:26] marostegui: so what is the status, should I start doing a diff of the master and the rest of the hosts? [09:07:57] jynus: so right now pending hosts to reclone are db1092 and db1087 (sanitarium master) and all the sanitarium and labs [09:08:09] db1092 I will reclone in a bit [09:08:18] obviously the master is pending too [09:08:27] for db1087 I thought about doing what I said earlier [09:09:17] that won't work [09:09:24] why? [09:09:28] it will work [09:09:40] but people will not be happy [09:09:58] and will break tools [09:10:00] etc [09:10:01] yeah, but can leave db1087 exactly as the rest of the hosts without a full import? [09:10:09] that is my proposal [09:10:24] and mydumper of wikidata will be impossible to happen on labsdb [09:10:31] it will take 1 week [09:10:35] no, it will be coming thru replication [09:10:37] and we will have matadata issues [09:10:46] if we remove the views no [09:11:03] well, first it has the load of 7 other sections [09:11:03] Don't get me wrong, I am in for a fix without a full import, I just don't know if we can do it :) [09:11:13] jynus: why all 7 sections? [09:11:15] I am asking to let me try [09:11:31] as it will be some work [09:11:43] but it will save a lot of pain [09:11:59] let me depool one "old" and "new" host, do a diff offline [09:12:05] we just need to load wikidatawiki not the other sections, which has its own replication thread [09:12:33] jynus: sure, let me fully repool db1104 first [09:12:35] doing it now [09:12:45] we are talking our largest wiki on one or our bussiest servers [09:12:59] when we can just copy it from codfw [09:13:08] from sanitarium codfw? [09:13:12] yes [09:13:29] with a diff? [09:13:38] no, physically [09:13:44] no, we cannot do that [09:13:49] it is multisource [09:13:50] why? [09:13:54] and? [09:14:00] how did you set it up in the first place? [09:14:14] so you trust the portable tablespaces? :) [09:14:44] I prefer that to breaking labsdbs for 2 weeks [09:14:55] sure, but that can also break the whole host [09:15:05] With the other approach we just "break" wikidata [09:15:20] let me do the diff [09:15:23] +1! [09:15:29] that should work and will take less time and fix the master [09:17:09] moritz told me not to use neodymium, so I tried to start the clone from cumin2001 [09:17:20] but I got this: [09:17:24] https://www.irccloud.com/pastebin/X6LvDIii/ [09:17:57] your wmfmariadbpy modulers is not on your path [09:18:23] because it is not installed on the system [09:18:45] try changing the import so it searches on the right path or change the python path, etc. [09:19:11] yep, I just wanted to give a heads up as probably it won't only affect me [09:19:38] marostegui: am I correct in thinking that all of the slaves that the dispatching is selecting from now have compressed tables? relating to https://phabricator.wikimedia.org/T205865 [09:20:12] I'm now seeing the same amazing dispatching experience as when we were running in codfw :) [09:20:29] I'm wondering if there is some process we need to put in place to make sure the tables are always compressed? [09:21:22] innodb options on mw config [09:21:37] *looks* [09:21:48] addshore: only db1087 (vslow) and db1092 have not been recloned [09:21:51] (and the master) [09:21:57] /*$wgDBTableOptions*/ [09:22:19] or, you know, for wikibase, it can be done on its own way [09:22:29] jynus: Can I take db1092 down? [09:22:41] what do you want to do? [09:22:45] reclone it [09:23:05] can we just depool it and use it to do the diff? [09:23:10] sure [09:23:22] or db1087 [09:23:25] whatever you prefer [09:23:42] well, db1087 we should keep it to let it replicate? [09:23:42] so pending to reclone: db1071, db1087 (and all its slaves) and db1092 [09:23:54] yep [09:24:09] can we live without 87 and 92? [09:24:21] or should we change replication to replicate for codfw? [09:24:27] for replicas [09:24:28] without both of them? [09:24:33] yes [09:24:37] I believe so [09:24:39] mediawiki-wise [09:24:40] Let me prepare a patch [09:24:46] and we can see it there [09:25:03] we can maybe stop db1087 and codfw in sync [09:25:11] and migrate replication and then clone it? [09:25:27] I don't know [09:25:44] stopping db1087 and codfw in sync is a good idea [09:25:53] check the patch [09:26:16] this one? https://gerrit.wikimedia.org/r/465634 [09:26:26] no, it is coming [09:27:08] come on git... [09:27:31] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467295/1/wmf-config/db-eqiad.php [09:29:29] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) db2096 is getting recloned frmo db2069 [09:37:42] we should leave 2 hosts with api [09:37:51] it is wikidata, it has lots of bot activity [09:38:24] maybe putting db1109 as low api [09:39:23] then we reclone db1092, and use it to compare it to db1087? [09:39:35] I set back the max_binlog_size to 1048576000 on parsecache hosts [09:40:15] jynus: should I give some api to db1109 then? [09:40:29] maybe db1104 "2" and db1109 "1" for api [09:40:52] banyek: log it [09:41:14] yeah, it is not as much that a single host can handle that [09:41:21] but it suffers from connection overhead [09:42:02] so I learned some time ago that even if the load is low, the connections are high and we should have 2 api hosts everywhere [09:42:16] check the patch again \o/ [09:42:22] I leave running the log cleaners for an hour to clean up the small files [09:42:48] banyek: cool, also update the task description as I believe one of the ticks to mark was the binlog sizes restoration (don't remember for sure) [09:43:04] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:43:13] <3 [09:43:18] I can't be fast enough <3 :) [09:43:42] it was a reminder (I sometimes forget!) [09:43:48] marostegui: just check load after deployment [09:43:51] yep [09:44:02] jynus: won't reclone db1092 until you are done with it [09:44:09] and we can do some followups to make connections similar across hosts [09:44:27] marostegui: actually I was going to ask you do it (or I do it) ASAP [09:44:37] ah [09:44:37] because comparing 2 eqiad hosts will be faster [09:44:43] I thought you wanted to use it first [09:44:52] Sure, I can start with it (and you adjust weights if needed?) [09:44:55] I am definitely going to use db1087 [09:45:04] already [09:45:21] but I need someting to stop properly [09:45:27] and an eqiad host will be faster [09:45:29] Great [09:45:33] So I will reclone db1092 now [09:45:49] I will adjust weights [09:45:55] thanks :* [09:45:56] checking performance and open connections [09:45:57] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) I made a sub ticket for adding this to Wikibase itself for 3rd party users. [09:46:17] hosts depooled [09:52:40] the cache cleanup was finished on parsercache hosts, however it seems the binlogs are still written really fast [09:52:50] db1109 is ok, although loaded [09:53:03] will continue checking as traffic increases [09:53:13] banyek: that is kinda expected [09:53:32] banyek: consider do an optimize on codfw [09:53:49] at least to see if that helps with space, and how much [09:53:59] jynus: https://phabricator.wikimedia.org/T206740#4665109 [09:54:00] and we may do the same thing when the new hosts arrive [09:54:05] * marostegui hugs jynus [09:54:15] :-) [09:54:23] yes, that't the next on my list :) [09:54:35] however I was thinking about to keep the purgers running [09:55:01] (I'll check the graphs, etc. before getting with numbers) [09:55:30] banyek: the normal state is not to have them, so I think we should stop and see if we for some reason have an additional writing problem :) [09:55:38] maybe stop them only on codfw as it is not active [09:55:53] but you own that task [09:55:59] so up to you :) [09:56:26] * banyek grins "happily" [09:56:32] yay [09:56:37] ;) [10:52:04] I'm going to deploy a change on s7 that improve performance but if you see any anomalies, let me know [10:57:59] The recloning of db2096 from db2069 is finished, so I make that two hosts replicate again [11:24:27] 10DBA, 10User-Banyek: /run/mysqld with correct rights should be exitst - https://phabricator.wikimedia.org/T207013 (10Banyek) [11:26:26] 10DBA, 10User-Banyek: /run/mysqld with correct rights should be exitst - https://phabricator.wikimedia.org/T207013 (10jcrespo) > because there was no /run/mysqld directory This is because the server was not rebooted- /run/mysqld is created automaticly by tmpfile config on start. [11:27:08] 10DBA, 10User-Banyek: /run/mysqld with correct rights should be exitst - https://phabricator.wikimedia.org/T207013 (10Banyek) oh. [11:28:00] 10DBA, 10User-Banyek: /run/mysqld with correct rights should be exitst - https://phabricator.wikimedia.org/T207013 (10Banyek) 05Open>03Invalid See: T207013#4666042 [11:29:02] I'll go and eat something now [11:29:09] 10DBA, 10User-Banyek: /run/mysqld with correct rights should be exitst - https://phabricator.wikimedia.org/T207013 (10jcrespo) See: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/427902/ [11:56:29] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) >>! In T206743#4665661, @jcrespo wrote: > @Addshore The main t... [12:00:29] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) > I'll generate a list once all slaves that serve mw traffic ar... [12:44:01] I'm gonna remove the manually created /run/mysqld directory, and restart db2096 to get rid some future problems [12:47:50] does anybody doing something on dbstore2002? [12:48:05] I see there's a replication lag there [13:04:52] jynus: ok to repool db1092? [13:05:07] banyek: anything going on on dbstore2002? [13:05:15] marostegui: no, I need them depooled for a long time [13:05:23] both db1087 and db1092 [13:05:31] I am going to stop replication on them [13:05:40] It was shown replication lag, but it vanished [13:07:10] jynus: roger, db1092 is all yours then [13:19:50] line 645 and 689 [13:20:09] first when you said 'comment out' I commented out both of them [13:20:24] nothere [13:24:22] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) All the pooled replicas have now compressed tables, can you confirm from your end if this... [13:28:32] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) So as Jaime said at T206743#4666169 all the pooled replicas... [13:28:58] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [13:29:18] jynus: should I downtime db1124? [13:29:20] db1087 slave [13:29:45] I guess [13:29:49] ok [13:29:51] giving it 24 [13:29:59] 24h [13:30:16] why don't we start replicating from codfw? [13:30:25] my work won't be fast [13:30:36] and this is why we have redundancy [13:30:43] i am fine with that :) [13:30:47] I guess it may break? [13:31:09] I don't know [13:31:27] I downtimed it [14:44:52] The replication slace lag is exploding in codfw [14:45:00] banyek: check -operations [15:22:00] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) I have prepared https://www.wikidata.org/wiki/User:Addshore/20... [15:24:50] The battery finally arrived for db1092 please lmk when it's down so I can replace. Thanks! [15:31:01] cmjohnson1: you at the DC Now? [15:31:24] yes [15:31:37] ok, let me check one thing to make sure I can power it off [15:35:50] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10akosiaris) All this seems pretty correct to me and does explain what we 've experienced pretty well [15:41:50] cmjohnson1: db1092 is all yours! [15:43:05] great..thx [15:49:11] marostegui powering up [15:49:16] thanks! [15:52:43] cmjohnson1: I see the host now [15:52:44] thanks! [15:55:51] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) 05Open>03Resolved Battery replaced by Chris - thank you!: ``` Battery/Capacitor Count: 1 Battery/Capacitor Status: OK ``` [16:16:17] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10TBolliger) >>! In T204006#4665621, @Marostegui wrote: > Now that I see it we are even more than 50% done, as we have also done lots of sectio... [16:30:17] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) Yes, thanks @Marostegui for providing some context about how these changes normally happen. This is my first one and so I didn't know... [18:00:06] 10DBA, 10Cloud-Services: Prepare and check storage layer for shnwiki - https://phabricator.wikimedia.org/T206916 (10Banyek) p:05Triage>03Normal Please ping me (or anybody in the persistance team) when the wiki is created, and I can do the sanitzation [18:00:21] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for shnwiki - https://phabricator.wikimedia.org/T206916 (10Banyek) [18:12:48] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The next step will be to stop replication of pc1004 from pc2004 and then run the following code in a screen in pc2004: ``` for TABLE in $(... [20:01:18] 10DBA, 10MediaWiki-Database, 10Operations, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Imarlier) [21:18:30] 10DBA, 10Cloud-Services: Prepare and check storage layer for vnwikimedia - https://phabricator.wikimedia.org/T207095 (10Urbanecm)