[05:53:10] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3212977 (10Marostegui) db1071 and db1063 are done: ``` root@neodymium:~# mysql --skip-ssl -hdb1071 wikidatawiki -... [05:54:26] 07Blocked-on-schema-change, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3212978 (10Marostegui) db1071 and db1063 are done: ``` root@neodymium:~# mysql --skip-ssl -hdb1071 wikidata... [05:57:16] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3212979 (10Marostegui) labsdb1009 and labsdb1010 are done: ``` mysql:root@localhost [enwiki]> select @@hostname; +------------+ | @@hostname | +------------+ | labsdb10... [06:05:41] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3212981 (10Marostegui) I have double checked all the shards now and the wikis that contain that table, they all hav... [06:09:50] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3212983 (10Marostegui) Everything looks fine after this change. The writes have continued fine on the production hosts and dbstore1002, sanitarium and lab... [06:42:49] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3212996 (10Marostegui) s3 is done. The following wikis already had the correct structure. ``` arbcom_cswiki dtywiki ecwikimedia fiwikivoyage olowiki pawi... [07:09:18] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213030 (10Marostegui) s2 is now done. [07:24:02] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213035 (10Marostegui) s6 is done (db1022 is obviously broken so it never got the change: T163778, so ignoring it). [07:30:27] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213038 (10Marostegui) s4 is done. [07:39:39] db1040 free space is getting very low [07:39:51] should we just throw it away? [07:40:08] how much time a mysqldump? [07:40:53] A mysqldump maybe a day, a scp to dbstore1001 or so maybe a couple of hours [07:41:01] (i guess) [07:41:06] we can do that [07:41:23] if there is space on dbstore1001 [07:41:29] yes, there is with no problem [07:41:40] we should compress it though too [07:42:05] I will do it [07:42:09] great! thanks [07:42:10] and I will send it to es2001 [07:42:15] ah good [07:42:23] btw, review this: https://gerrit.wikimedia.org/r/#/c/350372/ [07:42:27] no rush [07:42:29] there is time [07:47:49] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213046 (10Marostegui) s7 is now done for the wikis with that table (centralauth doesn't have it) [07:49:17] Error 'Table 'bawiktionary.tag_summary' doesn't exist' on query. [07:49:26] yes, I am fixing those [07:50:13] we can put the db as IDEMPOTENT temporarilly [07:50:19] on db1069 [07:50:31] I rather not, so I have the errors under control :) [07:50:36] ok [08:09:03] most of ibdata seems to be empty- 300MB/s on one side, 0-9MB/s on the other [08:10:14] ah, that one was not using file per table [08:10:16] right right [08:11:40] I am also going to delete db1022 replication [08:11:52] no need for alerts if there is no data there [08:12:09] sure [08:12:20] we can even create the ticket to decomission it too no? [08:12:30] there is one already [08:12:40] ah :) [08:13:45] can I do s5 master change now? [08:14:29] db1026 is still doing the alter (only 10G left for the temporary table to reach the size of the real table) [08:14:33] but it is a non blocking change [08:14:36] so it is not delayed [08:14:57] so you can go ahead if you like [08:15:10] or wait like 30 minutes or so until the alter is done [08:15:15] ok [08:20:00] I think dbstore1001 failing doesn't have to do with load [08:20:07] but with contention on show slave status [08:20:26] but normally that has to do with load, no? :) [08:20:31] chicken egg issue [08:20:39] not necesarilly [08:20:49] there is a lot of nagios threads queued [08:21:08] they may be waiting to run show slave status [08:21:34] which we run 9*3 times ever 5 minutes [08:21:45] plus once per minute [08:21:59] yeah, but I have seen that in the past with super load (or stuck servers), where it takes ages to finish, and then others come, and it is a snowball and you never know what is the cause or consecuence [08:22:01] and then there is the backups, that may use that [08:22:08] remember when db1069 got stuck due to tokudb [08:22:13] show slave status would take ages to run [08:31:56] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213093 (10Marostegui) I have deployed the change on silver (labswiki) and on labtestweb2001 (labtestwiki) [08:32:40] I am almost sure it is the dumps [08:32:56] dump processlist: status: "killing slave" [08:33:31] which blocks nagios [08:33:37] which makes the fail check [08:33:52] so we cannot get less threads because it takes quite long, but if we don't then it gets overloaded :( [08:33:56] *the check fail [08:34:00] it doesn't [08:34:08] it is the backup process, not the load [08:34:18] we probably can have even more threads [08:34:31] as long as we do not mess with blocking operations like show slave status [08:34:44] maybe we should migrate to performance_schema [08:35:05] mariadb doesnt suppot performance_schema slave monitoring, I think [08:35:13] * marostegui cries [08:35:14] only mysql muti-source replication [08:35:25] I did that for processlist [08:35:31] but let's be honest [08:35:41] well we can increase the check time [08:35:45] for this host [08:35:50] no [08:35:55] we fix the backup system [08:36:00] well yes [08:36:01] which is planned anyway [08:36:09] but that is take a bit long i assume? [08:36:15] no [08:37:05] 5 days to write it [08:37:08] 30 days to test it [08:38:18] ah :) [08:40:41] we can also change the nagios check so it doesn't poll show slave status [08:41:18] it is almost done, but it would get false positives on hosts like db1022, where RESET SLAVE has been run [08:41:39] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3213100 (10Marostegui) I have downtimed these hosts for 24 hours: ``` db1094 db1093 db1092 db1091 es1019 ``` [08:42:18] do you need help powering off? [08:42:21] no [08:42:23] no worries :) [08:54:03] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213109 (10Marostegui) s1 is done. Pending s5 which I will wait until the switchover is done: T162133 Meanwhile I will double check all the done shards t... [09:21:16] I am going to deploy the query killer to db2055 [09:21:22] jynus: you can go ahead with s5, db1026 is done [09:21:26] ah! [09:21:27] nice [09:21:34] then I will do that first [09:21:40] great! [09:23:05] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3213155 (10Marostegui) db1026 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb1026 wikidatawiki -e "show crea... [09:24:15] I am going to shutdown the nodes for the network maintenance but I will not shutdown the s5 one :) [09:24:44] ok, so I am going to change the master from db1049 to db1063 [09:25:05] yes, that is correct [09:25:28] which means we will have 4 masters on D1 [09:25:29] D1: Initial commit - https://phabricator.wikimedia.org/D1 [09:25:52] oh... [09:28:07] s7,s6 s5 and s4 yes [09:28:14] is D1 going to be down? [09:28:39] No [09:28:40] https://phabricator.wikimedia.org/T162681 [09:28:48] some hosts are going to be moved (d2) [09:29:07] and ALL the servers in row d need to be recabled, so they will have small connectivity loss [09:29:53] let's do this fast [09:30:08] and we may want to move away some servers physically [09:30:34] yeah, we can talk to chris to see if he's got time tomorrow or friday [09:33:21] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3213195 (10Marostegui) The following hosts are down and ready to be moved anytime (@ayounsi): es1019 db1094 db1093 db1091 Pending db1092 which is going to be involved in a master switchover(T16213... [09:33:47] I can move db1092 now [09:34:29] no worries, we have time until 14:00 UTC :) [09:34:32] take your time [10:19:29] https://www.reddit.com/r/mariadb/comments/62fft0/i_am_a_software_engineer_for_the_mariadb/dfm9gy8/ [10:21:11] oh wow [10:21:30] you made my ticket famous :p [10:22:11] I am sorry, but it only has 1 up-boat [10:22:32] interesting that he doesn't think it is a serious problem [10:22:49] http://monty-says.blogspot.com.es/2017/04/mariadb-103-alpha-released.html [10:22:57] focus on compatibility [10:23:02] with Oracle [10:23:10] the database, not Oracle MySQL [10:24:38] that is quite interesting, I would hope that the second step is to make it with MySQL and I would have hoped that would have come before making it compatible with Oracle [10:24:49] because if they don't, they are clearly saying: we don't care about mysql [10:35:43] repool of db1017 [10:36:04] db1017? [10:36:16] has it been deployed? [10:36:29] I am lost sorry [10:36:29] oh, yes [10:36:35] forget it [10:36:37] it is not you [10:36:40] xddd [10:41:18] I think you are ok to proceed with schema changes on s5, if you want [10:41:53] thanks! [10:42:16] also we can shutdown that pending host [10:42:28] ok, will do that in a bit [10:42:34] so it can replicate dewiki at least [10:43:14] at your will [10:43:56] 10DBA, 13Patch-For-Review, 05codfw-rollout: Replace some masters in eqiad while it is not active - https://phabricator.wikimedia.org/T162133#3213333 (10jcrespo) [10:44:25] ^I will close this and create a ticket with a check list before failback [10:44:35] cool! [10:45:08] 10DBA, 06Operations: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3213336 (10jcrespo) [10:45:10] 10DBA, 07Epic, 13Patch-For-Review, 05codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#3213337 (10jcrespo) [10:45:12] 10DBA, 13Patch-For-Review, 05codfw-rollout: Replace some masters in eqiad while it is not active - https://phabricator.wikimedia.org/T162133#3153482 (10jcrespo) 05Open>03Resolved [10:49:45] 10DBA, 13Patch-For-Review, 05codfw-rollout: Replace some masters in eqiad while it is not active - https://phabricator.wikimedia.org/T162133#3213352 (10Marostegui) Great job!! [11:03:55] I am going to finally fix dbstore1001 if I can [11:04:13] and deploy the query killer afterwards [11:04:19] oh good [11:05:40] now that I think, also the delayed slave messes with replication thread [11:05:59] which means backup + events + nagios = receipe for blockage [11:06:37] yeah, it is a nice combination of things [11:07:04] load is high, but it is not causing genral os unresponsivness [11:12:06] also, and this is interesting: it makes replication to restart autonatically even if events are disabled [11:12:20] probably because they where queued/related to dumps process [11:12:34] oh, so that is what makes replication get started again? [11:12:44] remember some months ago we didn't know what it was? [11:12:47] yes [11:13:06] there may be dozens of them blocked [11:13:20] and they restart even if you say event_scheduler=0 [11:13:37] is that a bug or a feature? :) [11:13:40] or the dump process does it [11:13:49] becaue it does stop slave on start of the dump [11:13:59] and probably start at the end of the dumps [11:14:12] which actually breaks it [11:14:14] ah, so it is not the events then [11:14:20] one of the 2 [11:15:45] so I will wait for dumps to finish to reposition dbstore1001:s7 [11:15:56] sounds sane yes [11:16:06] and that is how you (and I mean I) break dbstore1001 [11:16:37] I reposition, but it starts, then data breaks [11:16:57] i have nightmares if we have to reclone dbstore1001 XD [11:17:22] well, it has solutions [11:17:28] on 5.6 and 10.2 [11:17:40] but the issue is the version of the other hosts [11:38:10] 10DBA, 13Patch-For-Review, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3213441 (10jcrespo) It works: ``` root@db2055.codfw.wmnet[ops]> SELECT * FROM event_log WHERE stamp > '2017-04-26 11:3... [11:53:32] hey :) [11:53:50] I've just encountered on cp1066 a situation similar to what marostegui described in https://phabricator.wikimedia.org/T150160#3050681 [11:54:28] apparently a possible way to fix the drac is a mc cold reset [11:54:41] ipmitool -I lanplus -H $MGMT_HOSTNAME -U root mc reset cold [11:55:06] which of course doesn't work because lanplus is broken, but then there's the option of using bmc-device on the host itself [11:55:28] I've tried this on dbstore1001 and it returns some output: [11:55:29] bmc-device --get-sel-time [11:55:30] SEL Time : 04/26/2017 - 12:52:10 [11:55:59] so FYI I think you could try this and see if it helps: bmc-device --debug --cold-reset [11:59:00] that didn't work on cp1066: ipmi_cmd_cold_reset: BMC busy [11:59:27] OTOH --get-sel-time also fails there, so... [11:59:28] ipmi_cmd_get_sel_time: BMC busy [12:00:13] ema: fwiw we have freeipmi on all hosts installed [12:00:26] so tools like ipmi-sel, ipmi-chassis, etc... [12:00:49] not sure if they can help [12:00:58] ema: oh thanks, I will try that :) [12:02:24] ema: so it only restartrs the drac right? [12:03:24] marostegui: I think so, yes. Maybe try that on a host you don't really care about first :) [12:04:30] yes XDD [12:04:33] that was the idea XD [12:07:05] ema: DB that you don't really care about... 404 Not Found :-P [12:07:31] volans: hehe [12:34:39] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#3213601 (10Marostegui) s5 is now done. [12:45:53] 10DBA, 10MediaWiki-JobRunner, 07Wikimedia-log-errors: Job runners throw lots of "Can't connect to MySQL server" exceptions - https://phabricator.wikimedia.org/T121623#3213633 (10jcrespo) 05Open>03declined Not happening in a long time- but nothing was technically done. Resolved or not needed anymore. [12:59:14] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3213671 (10Marostegui) db1092 is finally down too. [13:12:54] 10DBA, 06Operations, 10ops-eqiad: Move masters away from D1 in eqiad? - https://phabricator.wikimedia.org/T163895#3213695 (10Marostegui) [13:44:53] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3213837 (10Reedy) 05Open>03Resolved a:03Reedy ``` mysql:wikiadmin@db1084 [commonswiki]> select * from page where page_title = ''; Empty set (27.45... [13:46:02] to make dbstore1001 happier [13:46:15] now all alter tables are being executed one day later at the same time [13:46:27] yeah :( [14:00:40] marostegui: About? [14:00:52] For https://phabricator.wikimedia.org/T130067 is that s3 is done in eqiad, not in codfw? [14:01:03] Reedy: yep [14:01:11] cheers [14:01:27] Reedy: so we cannot close this ticket, as it is not yet finished across the board [14:01:31] Sure [14:01:38] once codfw is sby, we will reimport that table there [14:01:40] I was about to say it's not there on codfw [14:01:46] but that explains it :) [14:01:57] Was trying to run a maintenance script that is requesting it [14:02:17] we should consider that column not deployed really, it cannot be used yet [14:03:04] I don't think it's used in code [14:03:10] Just one slightly eager maintenance script [14:03:13] But there's no real rush to run it [14:03:16] haha [14:03:25] We will get it done hopefully right after the switch back [14:05:38] looks like the script does things fairly generically [14:06:12] you want me to reply on that task you talk about the script failing? [14:08:34] Reedy> I already replied [14:08:47] I'd already edited my comment :) [14:08:51] ah :) [14:08:55] no changes should be done using that even after the failback [14:09:03] the reimport has to be done first [14:09:13] so consider wl_id nonexistant for now [14:09:32] the ticket has to be fully resolved first [14:09:34] yeah [14:09:41] what I can do [14:09:52] which is something I been thinking for some time [14:10:02] is to create a mediawiki core branch [14:10:05] with the production state [14:10:11] only for tables.sql [14:10:20] so it is 100% clear the production state [14:10:35] wl_id literaly took 1 year to deploy [14:10:48] and now we are finally going from being behind head [14:11:03] to literally being ahead of HEAD [14:11:07] Like I say, no one is actually using it in production code, other than a maintenance script [14:11:20] yes, no problem [14:11:23] not too worried [14:11:27] Which was only CR+2'd yesterday [14:11:31] but it causes small issues [14:11:36] And I was bored, so thought I'd run it, and saw it fail [14:11:40] ah! [14:11:47] Posted to the bug, then started digging [14:11:49] so YOU ran it :-) [14:11:54] lol [14:12:00] we want your head! [14:12:03] :-) [14:12:27] what do you think about a table.sql to know the official production state? [14:12:31] whould that be useful? [14:12:40] or is #blocked-on-schema-change enough? [14:13:17] I think the latter is fine [14:13:34] I guess no one associated that script with the schema change to be done [14:13:34] if you get really, really bored [14:13:50] I have a task for you [14:14:03] similar to the reviewing tables for drop [14:14:04] well.... If I can do it from Zambia... ;D [14:14:13] I noticed a few in that list that really should go again [14:14:20] the logging pre 1_10 or whatever [14:14:21] hey, you have avsolutely [14:14:32] 0 need to do it [14:14:38] I am just suggesting [14:14:52] there are lots and lots of changes on schema-change [14:15:02] My internet connection is intermittant, but I've got time to do stuff [14:15:11] most are discussions that didn't reach to anywhere [14:15:31] but 1% was merged and didn't told us [14:15:51] I need to go over that list and just mark as #blocked-on-schema-change [14:16:06] the ones that actually went through [14:17:01] I started but not all are 100% clear [14:17:08] link to the list? [14:17:34] https://phabricator.wikimedia.org/project/view/161/ [14:17:58] https://phabricator.wikimedia.org/T51188 [14:18:17] there is only 15 on the last one [14:18:44] the idea is to delete that tracking and be left with the tags: schema-change for ongoing discussions [14:19:07] #blocked-on-schema-change for changes that are on HEAD but have not been applied [14:25:38] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448#3213983 (10Reedy) [14:25:38] * Reedy starts tagging a few [14:25:51] thank you! [14:27:35] 07Blocked-on-schema-change, 10MediaWiki-Database, 07Schema-change: Add index type_action - https://phabricator.wikimedia.org/T51199#3213985 (10Reedy) [14:27:42] marostegui, what do you think about https://phabricator.wikimedia.org/T17441#3210840 [14:28:34] do you think it is ok to start a schema change on templatelinks and pagelinks for the large wikis now? [14:28:53] I would wait for the network maintenance to be finished [14:28:56] But other than that [14:28:58] I would go for it [14:29:03] yeah [14:29:06] not worried about that [14:29:07] as we said, we can always change the PK if needed, because at least we would have one XD [14:29:10] but about potentially [14:29:21] taking days to finish [14:29:35] 10DBA, 07Schema-change: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674#3213989 (10Reedy) 05Open>03stalled [14:29:50] I am fine with that, we have a week for them to finish [14:30:00] maybe start with the smaller ones? [14:30:06] so we can advance and have more and more PKs? [14:30:06] actually [14:30:17] I would have thought about that [14:30:21] but those are non-issues [14:30:25] 10DBA, 07Schema-change, 07Tracking: Schema changes for Wikimedia wikis (tracking) - https://phabricator.wikimedia.org/T51188#948551 (10Reedy) [14:30:37] jynus: they are also a non blocking operation, so there will be no lag [14:30:37] most could be done in semi-hot way [14:30:50] I was planning to do them on the master [14:31:03] still, there should be no lag, no? [14:31:15] no lag on the master, but yes on the slaves [14:31:26] ah, when it gets replicated yes yes [14:31:27] unless we use the non-working [14:31:33] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342#3213993 (10Reedy) [14:31:36] replication channels [14:31:44] too risky i would say [14:31:50] that is why I wanted to discuss it with you :-) [14:31:58] Monday is off [14:32:26] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339#3214007 (10Reedy) [14:32:45] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338#3214011 (10Reedy) [14:32:54] we can start with small wikis, i prefer to have inconsistencies but shards done that no shards done to be honest :( [14:32:57] no? [14:33:11] my brain agrees with that [14:33:22] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757#3214012 (10Reedy) [14:33:24] my heart says we do enwiki and some of the large ones first [14:33:58] then there is oldimage [14:34:08] which has no appropiate keys [14:34:19] but it is scheduled for reformat [14:34:23] so templatelinks and pagelinks are 200G more or less and in a 64G slave it took 24 hours to alter a slightly similar table [14:34:48] that is about right [14:34:53] it is what it took last time [14:34:56] so there will be lag but the fast slaves and the 160G should be fine [14:35:01] in aronud 12-15 hours [14:35:10] not sure 100% it can be done online [14:35:20] it is a primary key change? [14:35:24] but a weird one [14:35:29] unique -> PK can be done online [14:35:36] ok [14:35:37] we have done it for revision table [14:35:38] 10DBA, 07Schema-change, 07Tracking: Schema changes for Wikimedia wikis (tracking) - https://phabricator.wikimedia.org/T51188#3214044 (10Reedy) [14:35:40] 10DBA, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Schema-change: Drop fj_new_sha1 field - https://phabricator.wikimedia.org/T51195#3214041 (10Reedy) 05Open>03stalled [14:35:45] ah, nice [14:35:47] mmm [14:35:49] wait a sec [14:35:56] did we? [14:36:15] but it is more to the point [14:36:27] if it is not online, it is better to "rush" it now [14:36:35] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593#3214059 (10Reedy) [14:36:43] no, the revision table did have a PK indeed [14:36:51] but for the hosts I tested it, for those tabls, there was no lag [14:36:53] Look manuel at all that new work we just got! [14:36:59] isn't that nice? [14:37:13] Anyways, I think we should go ahead, at this point if it generate lags we "don't care" [14:37:16] haha yes [14:37:22] I am seeing all my inbox getting full of reedy reedy reedy [14:37:38] I am glad we have some work today, we were running out of tickets [14:37:39] sigh [14:37:53] Although I have to say, dropping tables and stuff like that, I LOVE [14:38:03] many of those are low [14:38:16] mostly dropping columns or changing indexes [14:38:34] there's many more tables that want to just diaf [14:38:39] because backups are hard [14:39:10] yeah [14:39:20] backups are not the problem [14:39:26] 07Blocked-on-schema-change, 06Community-Tech, 10MediaWiki-Database, 07Hindi-Sites, and 4 others: Allow comments longer than 255 bytes - https://phabricator.wikimedia.org/T6715#1197408 (10Reedy) [14:39:29] is the undocumented used tables or old code using index [14:39:44] it is east to create, not as easy to destroy :-) [14:39:59] and it is no ones's fault, it is just the way it is [14:40:10] well, things like logging_pre_1_10 and shit are backups :P [14:41:13] I think what we need is 1 backend and 1 frontend developer to do the tracking job for us: https://phabricator.wikimedia.org/T104459 [14:41:21] I think that's all the open subtasks of T51188 triaged... [14:41:22] T51188: Schema changes for Wikimedia wikis (tracking) - https://phabricator.wikimedia.org/T51188 [14:41:40] Reedy, thank you a lot [14:41:41] Some moved to stalled, others tagged where there's stuff that needs doing at some indeterminate point in the future [14:41:47] yeah [14:41:51] there are many of those [14:42:06] Reedy thanks a lot!! [14:42:06] "it would be nice if..." but not done [14:42:16] yeah, indeed [14:42:30] that is why I ended up separating schema-change from the blockign [14:42:43] also because I had no time to look at 10 years of backlog [14:43:09] we will get better, I think [14:45:00] 10DBA, 07Schema-change, 07Tracking: Schema changes for Wikimedia wikis (tracking) - https://phabricator.wikimedia.org/T51188#3214070 (10jcrespo) I think Reedy has triage all of this tickets, so this tracking task can be archived (deprecated by #blocked-on-schema-change) [14:46:53] it always happens- that task is not declined nor invalid nor resolved- :-/ [14:47:21] Declined, I think? [14:48:10] 07Blocked-on-schema-change, 07Schema-change: Make user_newtalk.user_id unsigned - https://phabricator.wikimedia.org/T163911#3214088 (10Reedy) [14:48:27] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id unsigned in WMF production - https://phabricator.wikimedia.org/T163911#3214088 (10Reedy) [14:48:49] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Make user_newtalk.user_id unsigned in WMF production - https://phabricator.wikimedia.org/T163911#3214088 (10Reedy) p:05Triage>03Low [14:49:01] 10DBA, 07Schema-change, 07Tracking: Schema changes for Wikimedia wikis (tracking) - https://phabricator.wikimedia.org/T51188#3214106 (10jcrespo) 05Open>03Invalid Use the new workflow instead https://wikitech.wikimedia.org/wiki/Schema_changes from now on to tag in-discussion and/or pending to be applied t... [14:49:03] Yeah, decline it [14:49:06] or Invalid [14:49:08] really doesn't matter [14:49:16] But no doubt someone will disagree and change it :P [14:49:18] tehcnically it doesnt [14:49:27] I wonder if I should just make a patch to jfdi all of T157227 in one go [14:49:28] but people sometimes can get sensitive [14:49:29] T157227: Mediawiki tables with columns which references other columns but have different type (tracking) - https://phabricator.wikimedia.org/T157227 [14:49:53] I filed that one [14:50:17] while doing https://gerrit.wikimedia.org/r/337390 [14:52:35] ema: tried the bmc-device --debug --cold-reset with no luck :_( [14:52:44] Reedy, T6715 is not blocked on schema change [14:52:45] T6715: Allow comments longer than 255 bytes - https://phabricator.wikimedia.org/T6715 [14:53:00] it is heavily in discussion [14:53:01] marostegui: well it was worth a shot [14:53:04] rev_comment varbinary(767) NOT NULL, [14:53:05] yep [14:53:06] :) [14:53:12] master has 767 not 255... [14:53:26] or are we changing it further? [14:53:33] yea, but it is one of those change -done and we disagree [14:53:52] it it technically blocked on schema change [14:54:09] but there is ongoing discussion if it shoudl be deployed at all [14:54:14] hmm [14:54:15] fair enough [14:54:15] like that other one with the index [14:54:34] Why is T146570 under T157227? [14:54:35] T146570: Give user_properties a primary key - https://phabricator.wikimedia.org/T146570 [14:54:35] T157227: Mediawiki tables with columns which references other columns but have different type (tracking) - https://phabricator.wikimedia.org/T157227 [14:55:19] ha "is a subtask of" [14:55:28] vs "it is blocked by" [14:56:16] not very clear if you ask me [14:56:16] 10DBA, 06Operations, 10ops-eqiad: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3214117 (10Marostegui) [14:56:20] 10DBA, 06Operations, 10ops-eqiad: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3097215 (10Marostegui) Leaving this documented for the future. I tried a cold reset locally, but it doesn't fix the remote issue. ``` root@db1070:~# bmc-device --debug --cold-reset =============================... [14:56:39] marostegui, I am going to mass run the alter on s3 [14:56:46] and see tomorrow what happens [14:56:58] jynus: on all the wikis or etwiki? [14:57:08] I already did etwiki yesterday [14:57:13] ah true, sorry [14:57:17] today all s3 wikis [14:57:21] \o/ [14:57:27] so I have a way out [14:57:32] if we are not ok [14:57:48] and tomorrow we have a go/no go decision [14:57:54] with less things going on [14:58:03] (re: network maintenance) [14:58:12] me if we get all those PK in place before the switchover -> https://img.memesuper.com/100c5d677cc28ea3f154c70d641f655b_meme-crying-gif-crying-gif-meme_620-340.gif [14:58:37] oldimage on commons :-/ [14:58:57] yeah :( [14:59:22] but I see it as, the more PK we have, the better, even if we don't manage to get them all [14:59:54] we will need them for when we migrate to innodb cluster [15:00:01] xdddddddd [15:09:53] Reedy, I think it is the right way: https://phabricator.wikimedia.org/T6715#3214151 [15:10:34] it is merged, but discussion is ongoing if it should be really deployed [15:10:34] WFM [15:10:55] makes it more clear [15:10:59] that way expectations are clear [15:11:09] and reflects reality [15:11:24] marostegui, I have changed that board [15:11:34] feel free to disagree O:-) [15:11:48] which board? [15:11:55] blocked-on-schema-change [15:11:59] ah [15:12:00] let me see [15:12:11] just added a new column [15:12:28] I am starting to miss a "in progress" column there [15:12:31] as we have in the DBA board [15:12:36] aka "What is this shit" [15:12:36] we can add one [15:12:45] Yeah, in progress would be good [15:12:54] but I didn't want to duplicate DBA one [15:12:54] to know what has been started etc [15:13:07] I am ok if we can keep it up to date [15:13:14] I am not sure we will [15:13:24] I have been trying hard with the DBA one [15:13:58] I am probably not talking about you :-) [15:14:02] haha [15:14:19] Not everything in blocked-on-schema-change is tagged DBA [15:14:26] it should be [15:14:39] the wikidata one is missing :P [15:14:50] fixed [15:14:52] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3214165 (10Reedy) [15:14:54] maybe it is on the parent one [15:15:04] " It should not be used if a DBA was not CCd or the DBA tag was not applied to a parent ticket" [15:15:13] sometimes the parent ticket is the same [15:15:26] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3166532 (10Reedy) [15:16:04] So I am ok with adding a new column, if you want to maintain it [15:16:31] I think we can try and see if it works good [15:17:26] I just created it [15:20:15] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#3214211 (10Reedy) [15:22:40] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912#3214233 (10jcrespo) [15:35:24] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3214400 (10Papaul) Hi Papaul, Thank you for contacting Dell EMC Basic Server Support. This mail is with reference to the (Memory and CPU Issue) you had reported on your PowerEdge(R730XD). Please find... [15:39:36] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3214415 (10Papaul) Hi Papaul, I will get the motherboard and the memory module replaced at the same time but at the same time would like to request you to help me with the address of the location where you... [15:40:44] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3214416 (10Marostegui) Thanks Papaul! As per our chat, I have brought MySQL, ping me when you need it down again, [15:41:18] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3214419 (10matmarex) \o/ [15:58:29] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3214458 (10matmarex) [16:04:10] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3214508 (10Marostegui) For the record: the alter on labsdb1001 has been killed because we ran out of space there, same t... [16:08:30] lsof +D /srvuserdata shows no usage of files there [16:08:38] we can back them up [16:08:43] delete the partition [16:09:01] add them to tank-data [16:09:10] and get 3 TB for free [16:10:26] there is only u3532__ -> /srvuserdata/u3532__ [16:11:38] yeah and that s51187__xtools_tmp-20160128 [16:11:46] which looks old [16:13:19] [root@labsdb1001 16:13 /srvuserdata] [16:13:19] # find -type f -mtime -90 [16:13:23] we could just delete the partition [16:13:46] yeah [16:13:52] or [16:13:52] looks like nothing is being used [16:14:03] move more user dirs there [16:14:11] ones that are not innodb [16:14:18] and do a symlink? [16:14:39] I suppose that was the intended usage for /srvuserdata? [16:14:49] probably XD [16:15:00] we can shrink the partition [16:15:03] but if the partition isn't being used, it might be just easier to merge and get all the space for us [16:15:05] that would be safer [16:17:35] so in the last 120 days only s51187__xtools_tmp was modified [16:17:41] a few tables there [16:18:11] yeah, but that doesn't fit back on /srv, does it? [16:18:30] no, it is 254G [16:18:32] :( [16:18:41] technically it does, we have 370G available [16:18:44] in /srv [16:18:48] but you know what I mean [16:19:24] but s51187__xtools_tmp is not linked, right? [16:19:35] should we maybe compress all that, send it to somewhere and delete the partition? [16:19:38] no [16:19:38] only u3532__ [16:19:41] it is not linked [16:19:48] we move back that [16:19:53] we delete everthing else [16:21:16] actually, it is also on srv [16:21:26] isn't it a symlink? [16:21:42] no [16:22:17] [root@labsdb1001 16:21 /srv/sqldata] [16:22:17] # ls -lh | grep s51187__xtools* [16:22:17] drwx------ 2 mysql mysql 736K Apr 26 16:21 s51187__xtools_tmp [16:22:19] whereas [16:22:20] u3532__ -> /srvuserdata/u3532__ [16:22:26] [root@labsdb1001 16:22 /srv/sqldata] [16:22:26] # ls -lh u3532__* [16:22:26] lrwxrwxrwx 1 root root 20 Apr 8 10:44 u3532__ -> /srvuserdata/u3532__ [16:22:27] no, I meant that ^ [16:22:50] ah, no, i was talking about: s51187__xtools_tmp-20160128 [16:23:04] old backup, we purge it [16:23:04] which is in srvuserdata eating 254G [16:23:19] I say [16:23:37] let's do it [16:23:43] if it fails, it is all my fault [16:23:49] it is OUR foult [16:23:55] I would say we stop mysql [16:24:02] just in case [16:24:08] no need [16:24:18] is it all myisam? [16:24:20] I copied it live last time [16:24:21] yes [16:24:22] ah ok [16:24:34] I ionjly did flush tables with read lock [16:24:40] for that dir [16:24:47] if it worked, then fine [16:24:48] :) [16:29:46] doing now [16:29:58] ok [16:31:21] copying https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=labsdb1001&var-network=eth0&from=now-1h&to=now [16:32:22] I cannot believe we'll have almost 3TB back! [16:32:27] once we mergfe the partitions [16:33:08] looks cool and permissions looks cool too [16:33:40] i can select [16:33:43] from those tables [16:34:07] so I am going to drop the LVM group [16:34:15] well, umount first, I assume [16:34:18] yeah [16:34:29] if you want to paste the commands here, I can double check them too [16:34:40] well, I can handle the umount [16:34:47] yeah, i meant the lvm ones [16:34:48] XD [16:35:03] device busy [16:35:07] maybe you? [16:35:09] nope [16:35:11] i am not there [16:35:29] I only have one session [16:35:32] and I am on / [16:35:35] caould be me [16:35:37] I see three users from you connected [16:36:21] it is a bash [16:36:34] and a screen [16:36:54] let me check [16:37:09] yes [16:37:11] I saw the screen [16:37:13] pretty old screen [16:37:21] srop tables [16:38:05] I just killed it [16:38:21] ok [16:38:26] let's see if it was it [16:38:35] nope [16:38:37] not enough [16:39:22] i have rechecked all the screens [16:39:38] could be just prometheus [16:40:29] let's see if lsof finishes… [16:40:32] nope [16:40:45] yeah, do a full run and see if there is something left [16:41:23] aha [16:41:28] ? [16:42:13] aha as in: yes yes [16:42:18] sorry XD [16:42:20] ah, ok [16:42:26] it is taking ages [16:42:26] I thought you have found something [16:42:29] no no [16:42:35] i am doing a simple lsof | grep [16:42:39] to see if there is soemthing we are missing [16:44:30] i wonder why fuser -m /srvuserdata reports mysql pid [16:44:38] yeah [16:44:58] lsof still running [16:44:58] I do not see references to it [16:45:07] but it could be just the symlink? [16:45:11] I can flush tables [16:45:13] but we removed it [16:45:15] to close open files [16:45:17] try that yes [16:45:47] and now it is gone [16:45:50] fuser reports nothing [16:46:16] umounted [16:46:23] any breakage? [16:46:33] aside from the fuser going wild probably [16:46:38] *flush [16:46:53] nope, so far so good [16:47:12] i am checking random tables [16:47:16] and so far so good [16:47:19] and the log shows nothing [16:47:31] lots of metadata stuff [16:48:04] but back ok, I think [16:48:07] yeah [16:48:08] it is all gone [16:49:34] no raid no nothing [16:49:44] literally lvm on top of physical disks [16:50:39] there is raid [16:51:15] is there? [16:51:29] i believe so [16:51:32] Device Present [16:51:33] ================ [16:51:33] Virtual Drives : 1 [16:51:36] I see /dev/sda -> userdata_1001 [16:51:55] /dev/sdh -> tank [16:52:00] etc [16:52:20] ah, the virtual drive is for sda only [16:52:24] so there is raid for "/" [16:52:29] oh nice [16:52:35] so we do not lose the os! [16:52:37] 7s [16:52:39] xdddddd [16:52:42] /s [16:53:12] it is strange [16:53:33] not completely sure about the raid [16:53:43] because it is on sda [16:53:45] i think [16:54:50] the raid size is 3.2TB [16:54:55] vgreduce userdata_1001 /dev/sda [16:55:01] which matches tank-data [16:55:35] then: lvremove userdata_1001 [16:55:53] yes, that sounds good [16:56:18] still in use [16:56:55] fuser reports lots of things now :| [16:57:00] pvremove first? [16:57:07] good or bad? [16:57:25] coming back from downtime i guess [16:57:27] those pages [16:57:31] yeah [16:57:34] maybe we need to disable the vgfirst [16:57:36] vg [17:08:17] have we lost all downtimes? I am seeing that dbstore1001 for example isn't downtimed for the replication threads [17:09:08] maybe some? [17:09:18] the ones I did were for the 28 [17:09:38] but let's be honest, I have lost account with so many changes [17:09:55] yeah XD [17:10:04] but i am pretty sure i downtime stuff for long [17:12:30] oh! [17:12:37] you removed the volume finely! [17:12:56] was it lvremove first then? [17:13:22] vgremove does everything [17:13:25] at the same time [17:13:36] so I see now 3TB available [17:13:54] tank 8 1 0 wz--n- 6.33t 3.28t [17:14:03] (I think) [17:14:23] # pvs [17:14:23] PV VG Fmt Attr PSize PFree [17:14:23] /dev/sda tank lvm2 a-- 3.27t 3.27t [17:14:25] :) [17:14:38] let's do the final test, lvextend -L+100G ? [17:14:40] but I cannot lvextend [17:16:20] what do you get? [17:16:46] Insufficient suitable allocatable extents for logical volume data: 260134 more required [17:17:09] I think it is striped and needs a different method [17:19:52] maybe we need to rescan the vg again? [17:21:09] wait [17:21:18] the group was increased [17:21:25] the other way round [17:21:35] we have to increse the size allocated [17:22:04] Allocation inherit [17:22:18] so we might need the alloc normal data [17:22:21] I believe it was [17:22:23] let me google that up [17:22:45] Allocated PE 0 [17:22:52] yes, it is part of the group [17:22:58] but has no allocated parts [17:23:13] we need to say: hey, you can use this physical disk [17:24:14] https://serverfault.com/questions/829372/insufficient-suitable-allocatable-extents-when-extending-lvm/829389#829389 [17:24:36] yeah, I may have done that by accident [17:25:18] still, Allocated PE 0 [17:25:21] and it didn't work? [17:29:42] ok, I have remove it again [17:29:50] I am reading that some people uses: lvexted -l [17:29:52] i will format it and try again [17:29:53] instead of -L [17:31:29] added again [17:32:55] ok [17:32:57] lets see [17:34:01] I cannot see it [17:34:05] do you want to try? [17:34:37] https://serverfault.com/questions/829372/insufficient-suitable-allocatable-extents-when-extending-lvm/829389#829389 [17:34:42] did yiu try that one then? [17:34:50] just to discard it [17:34:58] --alloc normal, I will try [17:35:27] Volume group allocation policy is already normal [17:36:04] :( [17:37:43] --alloc anywhere ? [17:37:50] i was reading about it [17:38:07] we can try [17:38:24] If there are sufficient free Physical Extents to satisfy an allocation request but normal doesn't use them, anywhere will - even if that reduces performance by placing two stripes on the same Physical Volume. [17:38:36] or we can use 2 partitions on the same disk [17:39:02] yeah, it works now [17:39:09] I think it was striped [17:39:11] what did you do? [17:39:17] and we just add a new disk [17:39:24] while we needed 2 [17:40:17] yaaaayç [17:40:20] I see it now [17:40:25] so what was the key command? [17:40:38] --allow anywhere [17:40:43] you were 99% right [17:40:51] I just did the rest 1% [17:40:58] and took the credit [17:41:13] as usual :-) [17:41:18] that is not true! [17:41:22] i didn't type a single command! [17:41:25] can I run the alter now? :p [17:41:30] look, this is not safe in any case [17:41:36] so this is more than acceptable [17:41:45] (the disk configuration, I mean) [17:41:48] yeah [17:41:56] but hey, this is an emergency! [17:42:07] and we should do the same on labsdb1003 maybe (not today!) [17:42:26] there is still 2.27 T free [17:42:53] on 1003? [17:42:57] I only see 500G [17:43:13] "what do we say to the god of breaking production? - not today" [17:43:24] xddddd [17:43:26] but you can run the alter [17:43:30] and break it :p [17:43:33] ok, I will do it now [17:43:33] that we can do today [17:43:40] it will be a good test for the new partitions [17:44:18] 1003 has 500 GB free [17:44:25] so that should sustain for longer [17:44:41] I will run the s3 alter [17:44:47] Great! [17:44:49] Alter running [17:44:49] but I will check the downtimes first [17:44:50] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3214980 (10Marostegui) >>! In T162539#3214508, @Marostegui wrote: > For the record: the alter on labsdb1001 has been kil... [17:44:57] I am going to logoff for now I think [17:45:02] yeah [17:45:03] thanks [17:45:14] the network maintenance is done [17:45:20] nice [17:45:21] I have pnged arzhel to just confirm it [17:45:26] but all the hosts are up and replicating [17:45:54] see you tomorrow! thanks for all the help! [17:45:59] es1019 booted badly [17:46:02] I will check it [17:46:05] bye! [17:46:07] badly? [17:46:11] go away [17:46:21] nothing worrying [17:46:35] ok, let me know tomorrow what was it [17:46:38] thanks for keeping an eye [17:46:41] bye byeee [17:50:11] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3214992 (10ayounsi) Rack move to D7 and D8 are done. Switch ports configuration for row D is done. Remaining is to move servers' uplinks from asw to asw2 in the other D racks. [18:35:03] 10DBA, 10AbuseFilter, 06Performance-Team, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 13Patch-For-Review: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557#3215124 (10Krinkle) The fix was merged and is going out in this weeks branch. Should start applyi... [19:04:42] 07Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s3 - https://phabricator.wikimedia.org/T163912#3215281 (10jcrespo) ongoing ```lines=10 root@db1075:~$ cat s3.dblist | grep -v -e '^etwiki$' | grep -v -e '^aawiki$' | while read db; do echo "Altering $db..."; my $d... [19:05:39] 10DBA, 13Patch-For-Review: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3215282 (10Marostegui) Is that the operation that implies some small connectivity loss?