[03:43:02] 10DBA, 10MediaWiki-Special-pages, 10Datacenter-Switchover-2018: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages - https://phabricator.wikimedia.org/T206592 (10Bawolff) [05:05:12] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) 05Open>03Resolved All good - thank you ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name :Virtual Disk 0 RAID Level : Primary-1, Secondary-0, RAID... [05:10:54] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) 05Open>03Resolved The RAID got rebuilt fine. The disk came with some errors, but let's ignore that and stop wasting some disks, let's wait till it fails for real to replace it. ``` Number... [05:12:19] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) Thanks for the update Chris - unbelievable! [05:28:55] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) 05stalled>03Resolved Thanks Papaul! This looks goods - we will take it from here! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target... [05:29:09] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [05:29:56] 10DBA: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) [05:30:59] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [05:31:01] 10DBA: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) [05:31:08] 10DBA: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Marostegui) p:05Triage>03Normal [05:34:23] 10DBA, 10MediaWiki-Special-pages, 10Datacenter-Switchover-2018: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages - https://phabricator.wikimedia.org/T206592 (10Marostegui) That could be the HW difference we have between codf... [06:19:10] 10DBA: Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) - https://phabricator.wikimedia.org/T206204 (10Marostegui) Great news so far!! [07:55:47] o/ marostegui i have a general sql question for you :) [07:56:06] can dbs use pre existing indexes to help them build other indexes when you add them? [07:56:16] I guess the answer is yes, but I don't want to guess :D [07:56:55] jynus: ^^ [08:00:10] addshore: I am not sure I get what you mean [08:00:28] so, say i already have a table with an index on field1 and field 2 [08:00:34] and i want to remove the index and add one just one field1 [08:00:49] does it make sense to add the index first before removing the old one? [08:01:21] ie., will the index get added faster if i do it while the old index still exists? as mysql already has some indexey knowledge about the field? [08:01:32] addshore: it should be done on the same transaction ideally so no queries would run without that index if the server isn't depooled [08:03:13] but will it be faster if I add the new one first? or not? [08:04:05] addshore: I don't think so, but I am not 100% sure about that. Maybe knows :) [08:04:10] maybe jaime knows, I meant [08:04:29] marostegui: thanks! :) [08:04:43] * addshore will just sit here as normal and watch for any comments from jaime :D [08:10:46] I think the index will still need to be recalculated, that is why I think it won't be much faster anyways [08:17:57] "can dbs use pre existing indexes to help them build other indexes when you add them?" is ambiguous [08:18:27] normally you want to alter the same table multiple times at the same time so the table is not scanned twice [08:18:45] e.g. when you write create index; create index; drop index; [08:18:56] we apply those on your own way in a single statement [08:19:51] it is possible that executing a second alter sometimes could be faster if you test it, because page will be on the buffer pool [08:20:17] but there are too many "if"s there- the kind of alter, the version, the actual operations needed [08:20:33] e.g. the latest versions of innodb allow to create new columns instantly [08:20:57] some indexes are very fast to create, while other alters may take more time [08:21:06] you shouln't think about that [08:21:31] you ask for a change to be done and it is our job (or manuel's job :-)) to apply it quickly [08:22:00] for example, sometimes we ask if the change you are doing is going to be followed by another so we can do it at the same time [08:22:45] > does it make sense to add the index first before removing the old one? [08:23:06] create the drop index and the add index, we will get it right [08:23:27] (which, as manuel says, it is "at the same time") [08:26:44] thanks! [08:26:59] this wasn't for wmf production, but for the people using update.php in mediawiki :) [08:36:04] well, update.php is is its own way [08:36:23] because for core at least, it is done in a strange way to be compatible with sqlite [08:36:38] so not sure it is the best example [08:36:50] (that is why we don't use it) [08:45:30] apparently creating backups from 500GB hosts with 2 sections take less than from 128 ones with 5 sections [08:46:47] 500GB hosts what do you mean? [08:47:43] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Backup+of&scroll=151 [08:47:52] I was checking if we had fresh backups [08:48:31] we may have duplicate backups of tables on s3 and s5 [08:50:12] Ah yeah, true [08:51:42] backups on eqiad finished at 03:06 [08:52:11] well, 6:11 if you count m2 [08:53:01] m2 is otrs, right? [08:53:16] yes [08:53:39] and still ongoing, s4 and s8, with m5 not started [08:54:50] do you have a table rename script from the last time? [08:55:33] and can we run it on db1118 for testing? [08:55:37] I was checking for it yesterday [08:56:39] https://phabricator.wikimedia.org/P7655 [08:56:46] it needs to be adapted for the new wikis [08:56:52] as you can see, very complex script [08:57:20] we need an unrename one [08:57:41] yeah, I am looking for that one too [08:57:55] and actually, it need set sql_log_bin=0? [08:57:58] *needs [08:58:02] yep it does [08:58:14] maybe last time we renamed everything [08:58:24] no, I think that is the one I used on my local tests [08:58:33] That's why it doesn't have the logbin [08:58:36] ok [08:58:46] I am looking for the rename back one [08:58:48] do you want me to put it on neodymim? [08:58:54] yep [08:58:59] ok, will let you on your own [08:59:16] ping me so we can test it on db1118 (enwiki) [08:59:20] yeah [08:59:38] I may go for a coffe now [08:59:48] good I will look for the revert script now [09:01:13] haha the revert one is even more complex [09:01:17] I just pasted it on the same paste [09:02:24] arg, that is ugly [09:02:30] let me fix it [09:02:34] (but later) [09:02:36] I know! :) [09:02:46] But hey, it worked! [09:03:03] in this house we obey the rules of termodinamics! [09:03:08] hahahahahahahahah [09:03:11] I love that quote [09:13:13] jynus: marostegui: do you see me? [09:13:20] yes [09:13:40] I dont *see* you [09:13:58] :D OK, I meant see my message [09:14:19] banyek: this looks like a task for you to start tomorrow? https://phabricator.wikimedia.org/T206593 [09:14:22] I though you maybe had broken into my house [09:14:36] then I would yell: 'BOO' [09:14:59] 10DBA: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) a:03Banyek [09:15:09] marostegui: yes, I'll have questions probably [09:15:11] marostegui: BTW, remember we also have a set of proxies [09:15:29] banyek: feel free to ask, but we have many things to do to today, so probably better start tomorrow [09:15:31] we maybe should migrate to even with no arch change [09:15:41] jynus: you mean the new ones? [09:15:49] can you ask somebody to invite me to #mediawiki_security? [09:16:01] yes [09:16:03] marostegui: ok [09:16:06] banyek: now that you have the cloak, try to get robh to add you to the whitelist [09:16:34] jynus: you mean a old -> new replacement even with the same architecture, right? [09:17:56] 10DBA, 10User-Banyek: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 (10Banyek) [09:18:17] I wante to get T206593 done between tomorrow and friday, so we can know what to do with the m3 codfw master (regarding the BBU), as if it is failed, we can move an x1 host to m3, and do an m3 failover and decommission the old one [09:18:18] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [09:18:48] banyek expressed interest on replacing the proxies, so let's get the x1-m3 sorted between today and early next week and then we can check back on the proxies I would say [09:19:11] does that sound reasonable banyek? ^ [09:21:31] I guess so, yes. I am not sure now what that actually means - I mean task-wise. But probably you can turn me to direction [09:21:55] banyek: we can catch up on the plans once that host is on x1, which shouldn't take long [09:22:16] meanwhile we need to test the BBU on db2042, which I guess you'll do next week with papaul? [09:22:36] I didn't seen yet his answer [09:23:02] try to ping him on friday so we can schedule it for next week [09:23:44] ok, I will, b/c I didn't get an answer yet [09:24:22] yeah he is probably busy, I would suggest to ping him tomorrow so we can organize ourselves [09:51:56] MAYBE I'll get my laptop tomorrow, I just called UPS. They gave me a window 9am-7pm [09:52:06] 10DBA, 10Goal: Monitor backup generation for failure or incorrect generation - https://phabricator.wikimedia.org/T198447 (10jcrespo) {F26475371} {F26475370} {F26475369} {F26475368} [09:52:20] I could bet that they will bring it in the exact moment when I'll be in the kindergarten [09:52:58] probably [09:59:10] will we do an actual google meet now? [09:59:29] nop [09:59:54] it is part of the pending steps for the wikis movement, check our etherpad starting on line 19 [10:01:48] I forgot to add one step- update labsdb proxy [10:03:08] where is that? [10:03:21] I added it now [10:08:02] aha I see [10:20:03] so do we deploy? [10:20:26] yep [10:20:28] ready when you are [10:20:47] let's go in order, downtimes first [10:21:22] * banyek takes the observer position [10:22:03] we need to downtime read only of all changed masters (s1-s8, x1, es2, es3) until 16h [10:22:06] UTC [10:22:19] sounds correct [10:22:31] do you want me to merge the change meanwhile? [10:22:48] because you will do the downtimes you mean? [10:23:02] I do x1, es2, es3 downtiming [10:23:02] I can do either way, either mediawiki or the downtime [10:23:05] Ah [10:23:08] Cool thanks banyek [10:23:42] let's organize [10:23:57] banyek took a task, which one do you want, marostegui? [10:24:06] I can do the next step, the mediawiki deployment [10:24:19] ok, doing the read only of s* [10:24:26] cool [10:24:37] I will get the change merged but NOT deployed [10:27:20] I am thinking we should do line 21 and 22 before 20, just in case something uses dblists? [10:27:25] db2034 mariadb read-only downtimed [10:27:29] (x1) [10:28:53] db2016 mariadb read-only downtimed (es2) [10:29:00] mediawiki change merged but NOT deployed, waiting for you guys [10:30:06] db2017 mariadb read-only downtimed (es3) [10:30:22] I can go for s1..s8 to [10:30:27] *too [10:30:34] Jaime was going to do those, let's wait for him to report back [10:30:40] ok [10:32:05] lets count the downtimes, there should be 22 only [10:33:23] for starters, I see db1069 read only not downtimed [10:33:42] I did the other side only, fixing [10:35:18] db1069 mariadb read-only downtimed (x1) [10:36:11] es1015 mariadb read-only downtimed (es2) [10:37:06] es1017 mariadb read-only downtimed (es3) [10:37:17] ok done [10:39:36] I can see those [10:39:50] I can also see the s1-s8 [10:41:08] so, done? [10:41:15] yep [10:41:20] if done let's mark it as done on the therpad [10:41:37] do we want to do line 21 and 22 before 20? [10:41:50] yes [10:41:59] I changed before [10:42:12] because it only affects eqiad [10:42:20] or it will fail and we will need to revert [10:42:31] masters on eqiad are in read only (hard mode) [10:42:46] so, s5 codfw has the filter to ignore the wikis, which is correct, so that is done [10:42:48] while the other way round we allow for errors, in theory [10:43:12] sorry [10:43:23] I meant we should do them in the order it is [10:43:27] not in other [10:43:30] yeah [10:43:41] I mean: 21, 22, 20 [10:43:52] actually line 22 is not needed anymore [10:44:08] it is [10:44:17] s3 changes will happen [10:44:23] from codfw [10:44:29] if we rename the tables on s3 eqiad, anything that writes there will fail [10:44:40] and we don't want replication to fail :-) [10:44:58] I am getting lost [10:45:05] I am not, so ask [10:45:20] so, we are going to rename tables on s3 eqiad master [10:45:30] aaaaaaah [10:45:31] right [10:45:33] got it [10:45:34] :) [10:45:37] yep [10:45:39] maybe on a slighly different order [10:45:43] filter to ignore them [10:45:46] we add the filter, then we rename [10:45:49] yep [10:46:09] see the order now [10:46:23] make sense [10:46:29] let me deploy the dblists then [10:49:12] there is some mwmaint2001 tasks ongoing [10:49:35] deployment done [10:49:53] but I belive those will not break just, the list may have confusing actions (like updating s3 wikis and really updating s3 and some s5 wikis) [10:50:04] as the real config was already changed [10:50:20] I am going to open a task to releng later: https://phabricator.wikimedia.org/P7656 [10:51:28] shall I deploy the filter to db1075? [10:51:49] do we need to downtime replication? [10:51:57] nah, should be quick [10:52:02] or can we just do a quick CHANGE? [10:52:11] prepare it here, then, and we can see it [10:52:16] yeah, I was doing so [10:52:34] note down however the binlog pos [10:52:52] of the real codfw master and the local one just in case [10:53:09] show master status; show slave status\G stop slave; SET GLOBAL Replicate_Wild_ignore_Table = 'enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.%'; start slave; [10:53:18] the other way around [10:53:29] yep, i just realised [10:53:29] first stop, then shows; etc [10:53:46] stop slave; show master status; show slave status\G SET GLOBAL Replicate_Wild_ignore_Table = 'enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.%'; start slave; [10:54:19] looks good, log and paste the coords on the etherpad when done [10:54:26] yep [10:54:27] doing it [10:54:48] done [10:55:10] labsdb and dbstore1002 should be ok, right? [10:55:36] yeah, they will get it thru s5 [10:55:48] no issues so far at mw [10:56:06] we should test mwdebug1 pages after the rename [10:57:01] I am checking for labs [10:57:29] nothing broken on 1009 [10:58:40] ok, done for me, do you have the line for the rename? [10:58:59] yes [10:59:21] I mean I have the paste [10:59:40] yeah, but that was wikidatra [10:59:46] and without log bin [10:59:46] i know [10:59:52] maybe you updated that? [10:59:59] No, I will do in a sec [11:00:01] or just show me on nedimium :-) [11:00:11] whatever you prefer [11:00:13] (I thought you were going to do so) - but I can do it [11:00:16] give me a sec [11:00:23] I was going to change the other one [11:00:25] cool [11:00:28] I will change the rename [11:00:30] the rename back [11:00:35] I will show you on neodymium, one sec [11:00:38] I can do the other, too [11:00:44] you do the revert [11:00:46] And I do the other [11:00:48] sure [11:05:51] So I did a quick sed: /home/marostegui/rename_tables.sh [11:08:02] I was doing one in parallel because I got bored :-) [11:08:09] :p [11:08:15] up to you whichever you want to run [11:08:42] banyek: do you have x-wikimedia-debug installed? [11:08:53] no [11:08:57] what's that? [11:08:58] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [11:09:23] Good to have it, we will be using it today probably [11:09:50] nice [11:09:58] banyek: install that and try to get cebwiki to break on mwdebug1001 [11:10:47] marostegui: let me introduce you to the concept of loops :-P [11:11:11] jynus: a sed+echo was faster! [11:11:58] cebwiki? [11:12:20] banyek: one of the wikis we are moving [11:12:32] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Urbanecm) [11:12:34] 10DBA, 10Cloud-Services, 10User-Banyek: Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 (10Urbanecm) [11:13:37] banyek: just in case [11:13:45] Idon't want you to break it [11:13:45] like ceb.wikipedia.org I mean [11:13:59] I want to detect errors, if that was not clear [11:14:04] yes, but using x-wikimedia-debug to point to mwdebug1001 [11:14:08] *you [11:14:25] so you can browse the side from eqiad [11:14:40] marostegui: I edited your script, see it [11:14:52] so ugly now [11:14:58] :p [11:15:44] more refactoring of variable names [11:15:46] see it now [11:15:57] haha [11:16:28] i is for counters only [11:17:00] so we are good then? [11:17:05] ` is non standard [11:17:11] let me change that, too [11:18:52] looks good to me I copied to my home and did a dry run [11:19:02] great [11:19:53] drops will be difficult because multisource [11:20:03] yeah, last time we did it host by host [11:20:03] either one by one [11:20:07] or with filters [11:20:10] both non-ideal [11:20:16] i prefer host by host [11:20:21] but we can discuss that when the time arrives [11:21:12] I am getting some cebwiki errors [11:21:51] about? [11:22:30] most from mwdebug1001 [11:22:55] banyek: you browsing the site? [11:23:00] yes [11:23:11] great [11:23:16] I am too [11:23:16] but now I closed the window [11:23:28] no, keep doing it [11:23:31] well, but he is doing what active traffic will be doing [11:23:42] ok [11:23:48] which errors are you seeing jynus? [11:23:54] actually I am wondering which country is this [11:23:54] I am browsing too btw [11:24:21] banyek: you know wikipedias are not per country but by language, right? [11:25:06] I meant that, sorry [11:25:08] swat is happening, too [11:25:30] banyek: ceb is for https://en.wikipedia.org/wiki/Cebuano_language [11:26:07] jynus: browsing I am not facing any errors [11:26:12] ó [11:26:19] , I mean oh! [11:26:27] marostegui: yeah, but they appeared on the logs [11:26:49] jynus: can you share a link? [11:27:10] https://logstash.wikimedia.org/goto/cb77554f18188695b935fd353bbff78e [11:27:53] SELECT ts FROM heartbeat.heartbeat WHERE shard = 's5' AND datacenter = 'codfw' ORDER BY ts DESC LIMIT 1 [11:27:57] heh [11:28:22] yeah, but probably because they try to connect to cebwiki database? [11:29:18] Wikimedia\Rdbms\LoadMonitor::getServerStates: got lag times (global:lag-times:1:db1070:0-1-2-3-4-5-6) from local cache [11:29:31] yeah, that is strange [11:29:45] I think I know what it is [11:29:54] it tries to know the replication [11:29:56] lag [11:30:05] but because actual master != real master [11:30:09] it gets confused [11:30:34] no s5 heartbeat [11:30:40] the other way round [11:30:42] no s3 heartbeat [11:30:53] and because they are on a different shard, it has issues [11:31:21] we should prepare a deploy of moving wikis to s5 on codfw [11:31:25] even if they are not there [11:31:29] so they are consistent [11:31:38] this will not happen after deploy [11:31:43] yeah, exactly [11:31:48] because replicas and masters will be on the same section [11:31:57] lag checking now is confusing [11:32:05] But mwdebug should only be reading db-eqiad [11:32:11] (I assume) [11:32:23] yeah, but it tries to see the lag [11:32:25] and fails [11:32:56] becase it asks the real master and it is on a different section [11:33:13] yep, after the failover db1070 will be the real master indeed [11:33:14] it will not happen once everthing is on s5 [11:33:24] that is my evaluation [11:34:10] So the change to db-codfw.php isn't needed to stop the error [11:34:14] and this is why doing it on a cross dc is safer [11:34:27] because jobqueue and maintenance will be stopped [11:34:35] preventing those spureous errors [11:34:51] yeah, in fact, browsing works fine [11:35:02] also gtid is horrible for this [11:35:10] which affects the issue [11:35:15] although heartbeat doesn't fix it [11:35:25] because we decided not to replicate it [11:35:39] let's do the rename? [11:35:56] yeah [11:35:57] let's do it [11:36:07] banyek: check dewiki on eqiad [11:36:14] with mwdebug1001 [11:36:20] to see if there is some issue there, too [11:36:26] (it shoudn't be) [11:37:55] seems good after clicking a few [11:37:57] no errors [11:39:10] jynus: log the rename when you are done, so we can pinpoint things [11:39:20] oh, I was waiting on you [11:39:20] still seems good [11:39:26] I can run it! [11:39:28] jynus: hahahaha [11:39:31] I thought you were! [11:39:41] I just did a dry run [11:39:49] but didn't want to touch it because it was in your home [11:39:54] haha [11:39:56] do I? [11:40:00] Sure, whatever you prefer [11:40:02] ok, for now on [11:40:06] go! [11:40:17] before any maintenance, we declare a leader [11:40:26] we didn't this time :) [11:40:29] and he does everthing unless stated differently [11:40:41] I am running the rename script [11:40:51] excellent [11:42:13] it is ongoing [11:42:19] great [11:42:26] check for errors, repl, requesta [11:43:21] also, we should recheck mwdebug1* loading correctly still [11:43:29] done, it finished [11:43:40] checking shwiki on mwdebug1001 for example [11:44:35] checking enwikivoyage, probably the one with most traffic [11:45:36] hehe by random clicking I ended up at https://sh.wikipedia.org/wiki/Licenca_GNU-a_za_slobodnu_dokumentaciju [11:46:16] rcs is the most visited one that is uncached [11:49:16] shwiki seems ok [11:50:11] I have marked that line as DONE [11:51:27] we can keep monitoring, but I don't see anything alarming [11:51:45]