[01:04:51] 10DBA, 10MediaWiki-API, 07Performance: Slow query in API list=tags - https://phabricator.wikimedia.org/T164552#3237462 (10Catrope) [01:23:49] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3236866 (10Catrope) >>! In T164530#3236896, @jcrespo wrote: > Thanks. One thing that I have been suggesting lately... [01:28:53] 10DBA, 10MediaWiki-API, 10MediaWiki-Change-tagging, 13Patch-For-Review, 07Performance: Slow query in API list=tags - https://phabricator.wikimedia.org/T164552#3237504 (10TTO) [05:43:47] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3237677 (10Marostegui) db2059 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2059.codfw.wmnet wikidatawiki -... [05:44:45] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3237678 (10Marostegui) db2059 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2059.codfw.wmnet... [07:23:49] 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3237753 (10Marostegui) s3 is going to be interesting to run the pt-table-checksum on...as not all the tables exist on all the wikis as per my last checks some months ago. I would start by doing a list of the most important tables we w... [07:33:53] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3237779 (10Marostegui) a:03Marostegui [08:23:34] I have started to load again db1015 [08:23:41] minus cebwiki [08:23:45] good! [08:23:52] we will see how long it takes [08:24:15] I can also start testing more on dbstore2001 [08:24:42] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1015&var-network=eth0 [08:25:34] Are you loading in parrallel then? With how many threads if yes? [08:25:55] 8 threads [08:26:09] nice [08:26:17] so 8 databases in parallel? [08:26:25] no [08:26:37] it can do intra-db and intra-table parallelism, too [08:26:44] nice! [08:27:13] I have disabled all alerts on db1015 [08:27:23] in case something goes wrong again [08:27:38] you think it will finish today? [08:27:41] yes [08:27:52] cebwiki is a very large db [08:28:00] and should be migrated away from s3 [08:28:05] is that the one with a huge templatelinks table? [08:28:10] yes [08:28:23] * marostegui worried that knows that from his memory [08:28:23] migrating away is the first steop [08:28:36] the real way to fix this is to avoid: [08:36:56] | 2617034 | 10 | ; | 0 | [08:37:05] repeated 100M [08:37:08] https://ceb.wikipedia.org/w/index.php?title=Espesyal:WhatLinksHere/Plantilya:;&hidelinks=1&hideredirs=1 [08:37:36] 10DBA, 06Operations: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3237844 (10Marostegui) You think db1024 can go away now? It was the old old master (it is depooled) and as per T154485#3171631 we should be good to go. [08:38:24] 10DBA, 06Operations: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3237847 (10jcrespo) Yes [08:39:35] 10DBA, 06Operations: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3237850 (10Marostegui) >>! In T162699#3237847, @jcrespo wrote: > Yes Great, will prepare the patches and merge them next week and let Chris know so he can do his part too. [08:48:22] oh, you are running alter table there [08:48:24] I will wait [08:48:51] Yes, in 2001 and 1001 [08:48:53] 2002 is free [08:49:05] yeah, but I want a place with plenty of space [08:49:12] Ah [08:49:14] I will wait [08:49:16] np [08:49:41] mydumper detected the long-running query and warned [08:51:18] Sorry then, it will take almost the whole day [08:52:06] no sorry! [08:52:30] as if I didn't have plenty of things pending to do! [08:53:47] haha [08:53:48] yeah [09:02:51] or [09:03:07] I can try remote dumping from an inactive codfw server [09:03:36] for example db2071 [09:10:16] Sure, not touching that one [09:16:24] 10DBA, 06Operations, 13Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3237920 (10Marostegui) All the patches to decommission db1024 are ready now. I will merge next week and create an specific task for Chris for that host so he d... [09:54:31] is there any database pending to clone? [09:55:06] pending to clone for what? [09:55:16] like pending to fix [09:55:20] like broken that needs cloning? [09:55:20] ah [09:55:24] or to install for the first time [09:55:36] mmm, no, because db1022 was the only one, and we decided not to care about it [09:55:41] I saw a new method of cloning that could potentially be faster [09:55:52] and supports multi-hosts [09:56:06] To install for the first time, maybe you can take one of the new ones and check the future db-eqiad.php and place it somewhere were it will be needed [09:56:32] maybe I can clone 2 of those, even if later we discard them [09:56:41] sure [09:56:58] But in order not to waste your time, if there is a clear place where it will go in the future, just do it I would say [09:57:06] sure [09:57:17] I mean, of course, that is why I saked [09:57:21] yeah yeah [09:57:37] I haven't reviewed that file in a while, but maybe there are some obvious cases [09:57:41] so you can install those already [10:00:52] I think aside from other changes, most of the new hosts will be the multi-instance ones [10:01:21] I know, I can clone s5 to future s8 [10:01:28] :) [10:01:35] and leave it for now as s5 with everthing [10:03:27] oh, lots of servers not booting [10:03:36] yeah [10:03:37] :( [10:04:04] this is getting complex- with current state, desired staten and everthing in-between [10:05:37] also, the earlier we use them and crash them the earlier we can complain [10:05:43] haha [10:05:44] yes [10:06:17] I am going to update the table on top of https://phabricator.wikimedia.org/T162233 [10:06:25] with the latest status of each one [10:06:30] that must be updated [10:06:34] I think I did it [10:06:41] no, not the racks [10:06:45] ah [10:06:51] but the lifecycle [10:07:15] the ones that are not strikethrough are the ones that are failing [10:07:36] ok, I didn't get that [10:07:38] thanks [10:07:44] maybe it is not clear [10:07:56] and I should specify it on the original table? [10:07:57] I will make it explicit [10:08:02] dont' worry [10:08:05] ok thanks! [10:08:16] sorry for the confusion :) [10:16:41] 10DBA, 06Operations, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3238142 (10jcrespo) [10:21:10] I am going to try to provision db1099 and db1103 at the same time from db1049 [10:22:42] is es2019 still in "probation"? all notifications are disabled in icinga [10:22:53] yes [10:22:59] it is not pooled [10:23:06] and I have to run a script to check it still [10:23:43] how much time we have spent on that guy all together? :D [10:23:45] Yeah, we left it there [10:23:49] ETOOMUCH IMHO [10:24:26] volans, do you want to be responsible for losing wikipdia content? [10:24:35] I think worth it every time [10:24:53] in money and all that? clearly not [10:25:02] jynus: it would be nice to run your comparey.py there, to see if it behaves well with load [10:25:06] or it crashes again [10:25:07] buying a new server would have cost less [10:25:19] marostegui, that's the plan [10:25:29] with all the other pending tasks [10:25:32] :-) [10:25:52] I was referring to get a replacement one, from the manufacturer, given that crashes all the time [10:26:01] not that the time spent was not worth [10:26:02] ;) [10:26:21] like it would have been so better if we were able to get it replaced after the 2nd or 3rd crash [10:26:30] 10DBA, 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3238169 (10jcrespo) 05Resolved>03Open Pending to run compare.py around the ids obtained before. [10:26:32] yeah, and not get it wiped too [10:26:33] :p [10:26:50] volans I would love to had that [10:26:54] but don't tell me [10:27:01] tell papaul or robh [10:27:18] and they will explain to you why they didn't/couldn't do that [10:27:28] i wonder what is needed to get a full replacement, I guess you really need to probe it [10:27:38] otherwise change main board and good luck [10:27:42] but it is the same we had with db2034 [10:27:51] it would have been way easier to get it replaced [10:27:53] that one was fun [10:29:29] 53m28.379s to backup enwiki [10:29:35] nice! [10:29:44] despite the alter [10:29:53] although I suspect a remote copy may be faster [10:30:04] 93G [10:31:05] enwiki.revision.00000.sql.gz - enwiki.revision.00006.sql.gz + enwiki.revision-schema.sql.gz [10:32:24] so, one file with the table structure and then 6 files with the data itself? [10:32:30] yes [10:32:33] for large tables [10:32:33] very nice [10:32:43] those can be loaded in parallel [10:32:55] for the same table? very nice [10:33:13] basically, pagelinks, revision, text and templatelinks [10:34:13] we may want to even separate backup place from lagged slave [10:34:25] in the future [10:35:13] you can see the structure at dbstore2001:/srv/tmp/export-20170505-090626 if curious [10:35:32] including a metadata file [10:36:55] So you define at which point you want to chunk the data? like: tables bigger than 100M rows and then it automatically does that? [10:37:05] either that or by size [10:37:12] Or can you also say: for this table I want 10 chunks [10:37:13] I thought by row was better [10:37:23] thinking on parallel load [10:37:28] even if they are thiner [10:38:37] the slices are not perfect because it uses heuristics to be able to do paralelism [10:39:49] with this we can have more flexible backups in 8 hours [10:39:58] yeah, it is a nice discovery [10:40:37] well, I was going to reimplement it because I though this older version wouldn't work for us [10:40:53] the version in strech works better [10:40:58] regarding locks [10:41:25] also no gtid registration, only binlog position [10:42:05] my aim with this is to create a cron + script on dbstore2001, test it [10:42:21] and once we are happy, run it on dbstore1001, too [10:44:26] 10DBA, 06Operations: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3238218 (10jcrespo) a:05Marostegui>03jcrespo [10:50:57] 10DBA, 06Operations: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3238231 (10jcrespo) I am recovering db1015 again minus cebwiki. Better that leaving db1015 broken and doing nothing. I have disabled notifications on db1015 just in case. Crea... [10:52:02] I think I can create backup on /srv/tmp, and once they finish, package them (eg. tar per database on s3) and move them to /srv/backups [10:55:00] 10DBA, 06Operations: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3238240 (10jcrespo) a:03jcrespo [10:55:11] Remember dbstore2001 might have old backups on srv/backups [10:55:17] I would say delete them anyways [10:59:01] yeah [10:59:10] as long as they are localizable [10:59:18] and more than enough space [10:59:24] they can be left there [11:00:02] I will organize things with puppet soon [11:07:50] jynus: re your comment @ https://phabricator.wikimedia.org/T164407#3237950 it doesn't look like that is cognate related [11:08:20] I was just pasting anything anomalus I can find there [11:08:47] the backgroud being that now there is almost no innodb contention on x1 [11:12:05] interesting, an no idea as to the cause? [11:12:12] nope [11:12:17] traffic change? [11:12:32] so the interesting part is how variable stuff is in x1 [11:12:44] that was my main montivation of that comment [11:14:43] Innodb_mutex_spin_rounds dropped at 1:50 also (whatever that is) [11:15:06] yeah, just sometimes innodb does active wait [11:15:26] normal that if blocking goes down, spins go down too [11:15:44] that is for fine tuning innodb, and sincerely, we are not yet there :-) [11:30:13] 10DBA, 06Operations: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3238274 (10jcrespo) Left running on neodymium (I did some optimizations to ignore values below and beyond max id respectively): ``` while read db id; do echo "./compare.py es1019.eqiad.wmnet es2019.codfw.wmnet $db blobs_clust... [11:30:49] ^the thing with this is less "do you trust gtid" and more "do you trust the hw controller" to do its job [11:45:13] addshore, your answer on the ticket was exactly what I had in mind [11:45:50] decrease in query execution time? or? [11:45:51] something like bad performance created by a separate process creating overhead on cognate and making the write slow [11:47:10] any ideas on ways to track it down? is there a full list of things using x1 somewhere? [12:34:49] 10DBA, 10Wikidata: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#3238485 (10Lydia_Pintscher) [12:34:51] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3238480 (10Lydia_Pintscher) 05Open>03Resolved a:03Lydia_Pintscher Further work in T162539. [13:15:02] 10DBA: Drop pre_ tables - https://phabricator.wikimedia.org/T118859#1811055 (10Marostegui) For the record, they exist on: s2 s3 s4 s5 s6 s7 [13:16:52] 10DBA: Drop pre_ tables - https://phabricator.wikimedia.org/T118859#1811055 (10jcrespo) We should backup the backup, I normally archive them for some weeks on es2001. [14:05:39] BTW, I asume you saw the new checkboxes on tendril :-) [14:06:27] I did :) [14:06:29] hehe [14:06:36] Very useful XD [14:06:55] But I believe I saw you did that during the weekend or the bank holiday and I was like: grrrr!!! [14:06:58] but thanks! [14:13:07] 10DBA, 05MW-1.30-release-notes, 10MediaWiki-API, 10MediaWiki-Change-tagging, and 2 others: Slow query in API list=tags - https://phabricator.wikimedia.org/T164552#3238830 (10Anomie) 05Open>03Resolved The slow query isn't in ApiQueryTags anymore, now it's in ChangeTags and hidden behind a 5-minute cache. [15:12:26] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3239057 (10Marostegui) db2052 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# m... [15:13:07] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3239058 (10Marostegui) db2052 is done: ``` root@neodymium:/home/marostegui/git/software/dbtools# mysql --skip-ssl... [15:48:05] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3239219 (10Marostegui) @Cmjohnson just checking if in the end you updated the idrac firmware? No pushing by any means, just checking if I need to powercycle this host next week or not. Thank... [17:10:11] 10DBA, 10Monitoring, 07Puppet: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3239514 (10jcrespo) [17:11:19] 10DBA, 10Monitoring, 07Puppet: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3239534 (10jcrespo) [17:13:53] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3239556 (10Cmjohnson) I updated the firmware on db1070 and ipmitool is still not working, I compared the idrac settings via the gui with db1068 (ipmi works) and not differences between the t... [17:16:35] 10DBA, 10Monitoring, 07Puppet: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3239573 (10jcrespo) [17:45:02] 10DBA, 10Monitoring, 07Documentation, 07Puppet: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3239694 (10Reedy) [19:34:16] 10DBA, 06Operations: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3239953 (10jcrespo) It took a bit more than 11 hours to reload logically db1015 (minus cebwiki) - that is 1.3 TB (out of a total of 1.5TB for all of s3). It is now back replica...