[03:36:15] 10DBA: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758#3343472 (10Legoktm) [04:22:43] 10DBA, 10Operations, 10Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3343495 (10aaron) @daniel , can you look into the amount of purges happening in ChangeNotification jobs? I don't see an... [05:16:30] 10DBA: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758#3343461 (10Marostegui) Hi Yes, if you could use the script, that would be preferred so we do not have to touch the tables manually and you have all the throttling mechanism already implemented :-) Also please... [05:16:47] 10DBA: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758#3343513 (10Marostegui) p:05Triage>03Normal [05:20:04] marostegui: is it OK if I start it now? [05:31:14] 10DBA: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758#3343516 (10Legoktm) @Marostegui Is it alright if I start it now? [06:36:58] 10DBA: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758#3343569 (10Marostegui) @Legoktm go ahead [06:39:57] 10DBA, 10Labs: Prepare and check storage layer for atjwiki - https://phabricator.wikimedia.org/T167715#3341927 (10Marostegui) If this is going to be a public Wiki, there is nothing that the DBAs have to do. The only pending thing would be the table views created by #cloud-services-team once the tables are in p... [06:40:13] 10DBA, 10Labs: Prepare and check storage layer for atjwiki - https://phabricator.wikimedia.org/T167715#3341927 (10Marostegui) p:05Triage>03Normal [06:58:50] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935#3343584 (10Marostegui) 05Open>03Resolved a:05Cmjohnson>03Marostegui As Chris was having issues yesterday with the HP bundle, we decided that I would try to upgra... [07:53:50] 10DBA: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342#3343643 (10Marostegui) [08:27:00] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3343712 (10Marostegui) a:03Papaul [08:28:12] 10DBA, 10Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3343718 (10Marostegui) 05stalled>03Open p:05Triage>03Normal Let's close this for now [08:29:26] 10DBA, 10Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3343722 (10Marostegui) 05Open>03stalled [08:34:51] 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205#3343733 (10Marostegui) [08:49:29] Can you review if this would make sense to you? https://gerrit.wikimedia.org/r/#/c/358529/ [10:16:51] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Idh0854 -> Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3344164 (10MarcoAurelio) @greg Can we get a one hour window on wikitech:Deployments in coordination with @jcrespo and @Marostegui to do this? Thanks! [10:21:16] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3344183 (10MarcoAurelio) [11:22:12] 10DBA, 10Patch-For-Review: Migrate parsercache hosts to file per table - https://phabricator.wikimedia.org/T167567#3344306 (10Marostegui) [11:51:04] jynus: https://gerrit.wikimedia.org/r/#/c/358529/ -> does that make sense to you? (no rush to review it - I am going for lunch now) [12:05:18] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3344398 (10Marostegui) db1064 is done ``` root@neodymium:/home/marostegui# for i in `cat s4_tables`; do echo $i; mysql --skip-ssl -hdb1064 commo... [12:05:35] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3344399 (10Marostegui) [12:22:18] 10DBA, 10Labs: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#3344427 (10Marostegui) If this is a public wiki (T155038#2934291), the DBAs do not have to do anything. #cloud-services-team will need to create the views once the tables are in place [12:22:27] 10DBA, 10Labs: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#3344431 (10Marostegui) p:05Triage>03Normal [12:23:49] marostegui,jynus: hi! If you have time before next week it would be awesome to get a review of https://gerrit.wikimedia.org/r/#/c/356383/ [12:23:57] 10DBA, 10Labs: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#2931704 (10jcrespo) We do have to do the sanitization of both sanitarium (currently, the 2 of them) and labsdbs. [12:24:17] I am interested in your suggestions about the queries and improvements in general [12:25:28] I can give it a deeper look, but not today [12:25:39] however, I see some bad things [12:25:59] it is using the root user and password to connect [12:26:21] it should be using a separate user, even if it has high privileges [12:26:40] but most likely, it will only need write permissions on the log database [12:27:05] sure sure even next week is fine, I was just bringing it up [12:27:06] I suggested to parse my.cnf for the socket, but not for the user and password [12:27:43] in fact, I do not think that will work without a socket [12:27:48] about the user, I can remove the default and explictly ask for a config file, the plan was not to use root [12:27:50] because it most likely search for it [12:28:03] on the wrong place [12:28:28] what do you mean? It should be able to connect with proper user/pass no? [12:28:32] no [12:28:43] I was under the assumption that the socket was not to be used [12:28:47] most connectors will connect to localhost using the socket autentntication [12:29:07] unless I am missing a protocol=TCP somewhere [12:29:18] I suggested finding it like you said [12:29:31] because otherwise connection will fail [12:30:07] you can try with a simpler connect script [12:30:37] also, most likely, you will not find a root password [12:30:42] okok makes sense, I'll do it.. I tried only in beta but the config is completely different (EL db on the host, probably manually configured) [12:30:43] becaue we do not use one [12:30:56] yeah, beta probably uses the default socket connection [12:31:18] should I open a task to create a new user for the log database to establish the best set of perms? [12:31:18] so config.get('client', 'password') will run an exception [12:31:35] yes, or just send a gerrit commit, [12:31:40] whatever you prefer [12:31:46] super [12:32:07] we should use the user with the minimum privileges possible [12:32:25] and the advantages of localhost [12:32:40] is that we can run that with a specific unix user [12:32:46] completely agree [12:32:47] so no passwords are needed [12:33:00] it takes very little on the script [12:33:12] just adjust the code to look for the socket [12:33:20] there [12:33:37] it was like that in the beginning but I might have misunderstood you the last time and changed it to user/pass [12:33:51] no, I meant not to hardcode socket [12:33:54] the rest was ok [12:34:08] okok [12:34:13] you had something like '/tmp/mysql.sock' [12:34:38] you do not even need a config [12:34:56] just connect from os.efective_user() [12:36:37] okok [12:37:08] so the socket path is the only thing that needs to be configurable [12:37:14] the rest will be taken care by unix perms [12:38:11] we can make it user + pass [12:38:18] but if we do it [12:38:20] not like that [12:38:44] just user with no pass under the right unix permissions [12:39:03] this is a conversation, so I am not necesarely right [12:41:01] it makes sense to me [12:41:19] but this would be a couple of lines changes db_socket = config.get('client', 'socket') [12:41:32] then just use the socket + UID [12:41:46] which may need extra changes on the puppet to add the user [12:41:54] but I think those are easy [12:43:14] I am checking the offset parameter, I have never seen it on mysql before [12:44:44] jynus: is the same of limit 10,1000 but different syntax, I was surprised too, first time I saw it :D [12:45:10] ok, I wonder if it is native, or it is compatible with other db [12:45:43] I think LIMIT is not SQL standard, so I would assume it is not part of that [12:46:21] I guess so, I think is mysql/mariadb specific [12:46:39] well, some dbs started using it [12:46:46] because it became popular [12:46:57] as a standard extension [12:47:18] 10DBA: Implement slave_run_triggers_for_rbr at sanitarium for labs filtering - https://phabricator.wikimedia.org/T121207#3344506 (10Marostegui) [12:47:26] "Both MySQL and PostgreSQL support a really cool feature called OFFSET that is usually used with a LIMIT clause." from 2005 [12:47:27] yeah, seems that sqlite has it too, with both sytaxes [12:48:13] that query is the one I'm a bit worries about, see my comment in the CR, because of the known issue with big offsets [12:48:24] s/worries/worried/ [12:48:29] FETCH FIRST:(since SQL:2008) is the standard [12:49:27] http://troels.arvin.dk/db/rdbms/#select-limit-offset [12:50:41] so I agree with volans comments [12:51:33] what alternative would you guys reccomend? [12:52:13] so it depends [12:52:34] the easiest way is to use a library to avoid all the overhead of that pt-archive [12:52:46] if you want to implement it yourself [12:53:05] read the first and the last item to purge [12:53:13] (those should be immediate) [12:53:32] while current < last: [12:54:26] DELETE ... WHERE ts > X ORDER BY ts LIMIT batch [12:54:51] you can select if you want to get control [12:55:04] the delete part is more or less done in this way, it is the update that was a bit tricky [12:55:35] elukey: do you have a "modified" column that you touch when updating those rows? [12:55:36] don't select UUIDs and then update them [12:55:51] volans: nope [12:55:52] that is going to be horrible performance [12:56:11] select the timestamps [12:56:19] and do update ranges based on that [12:57:01] you can explain to get an aproximation of the number of rows [12:57:02] splitting the time could be a good approach [12:57:21] that is what pt-table-checksum and pt-archive does [12:57:30] or I have some example for compare.py [12:57:32] so taking the entire window of time, splitting it in multiple sub-time-windows and update all of them [12:57:37] no [12:57:49] do not split, just put a limit from a starting point [12:58:00] and then check the last updated one [12:58:07] could be a select first [12:58:15] basically pivot based on an index [12:58:32] normally the PK or the ts [12:58:35] how can I recognize the last updated one if I don't have anything that tells it to me? [12:59:15] (I am asking to undestand the solution, not to complain about it) [13:00:22] you can select it if you want [13:01:04] the thing is to avoid the offset [13:01:12] while ordering by an index [13:02:11] for example, you could count(*), max(ts) so it will give you how many rows are going to be deleted [13:02:35] and what is the last one deleted or updated [13:02:47] probably you may need an id [13:02:54] jynus: for the deletion it's ok AFAIK, the UPDATE is more problematic because he doesn't have any field to tell him if a row was updated or not, unless changing the schema [13:02:55] if there are many ts similar [13:03:07] that is why I say to keep track of it [13:03:21] if it is not by ts, by id [13:04:37] so that would be scaning by id, knowing the max and min id (supposing they are more or less correlated) [13:05:04] but with the options he wants (I assum ts < X) [13:05:35] it is difficult to explain, easier to code [13:05:43] I don't remember if the table has an ID or uses the UUID as an ID tb [13:05:45] tbh [13:05:48] it has both [13:05:51] I asked for ids [13:05:55] because uuids are useless [13:05:56] :) [13:06:09] and they are monotonically increasing [13:06:14] which simplifies the problem [13:06:22] unlike uuids, which are random [13:06:26] some tables do not have ids IIRC [13:06:29] but only uuids [13:06:41] or ts, which could be backfilled or not increasing [13:06:55] in that case, you are screwed [13:07:06] :D [13:07:11] and you will have to do a full table scan every time you do a purge [13:07:17] (mostly) [13:07:42] uuids are indexed IIRC, so it shouldn't be that bad no? [13:07:51] (ignorant /me asking) [13:08:00] how do you order by uuid? [13:08:15] and I am not asking to say ORDER BY uuid [13:08:34] I am asking how uuids are helpful to scan the latest 30 rows inserted? [13:08:47] they are not [13:08:52] do you see the problem ? [13:09:00] and why I have a cruzade againt uuids? [13:09:03] :-) [13:09:04] elukey: do you have a field that is *always* not-null on real rows and *always* NULL on purged rows? [13:09:12] that could work [13:09:16] if we had an index on ts and this field [13:09:31] the always NOT-NULL yes, the rest I don't know [13:09:46] it depends on what fields people need to save [13:09:49] and what not [13:10:16] right, you have only fields that you *always* keep as is, not that you *always* reset to NULL IIRC [13:10:29] yep [13:10:38] if you have one that is a good candidate... consider this option ;) [13:10:42] note I said I agreed with volans [13:10:55] I am not saying it can be done better [13:11:01] we have to work with what we have [13:11:15] lol [13:11:19] the problem with full scans is that not only it will be slow and painfuk [13:11:37] it will likely affect the "replication" [13:11:49] I think the newest tables have PKs [13:12:01] so maybe we could use PKs for the new ones [13:12:15] here PK == autoincrement [13:12:31] and fail back to full table scans for old tables [13:12:43] even have some kind of temporary metadata [13:12:58] "I have purged until timestamp X, do not scan again" [13:13:25] there are all ideas, not sure yet [13:14:37] for legacy tables without a tracking field another option could be to split the time in smaller chunks and use the limit+offset, assuming that the offset will be small given the smaller timeframe [13:15:02] not elegant at all, I know :) [13:15:08] yes, that depends on the expected groth, etc [13:15:40] I would definitely run some dry-run showing the queries to be done [13:15:56] with explain results, and we say "this is good enough" [13:16:00] or "this will not work" [13:16:07] sure, it might do a count for a larger period and assume more or less even distribution over time (that I'm sure is not of course :D ) [13:16:57] ok so I can came up with some numbers for the limit-offset use case [13:17:02] for the busiests tables [13:17:09] ok [13:17:09] and paste the results in a task [13:17:11] super [13:17:14] that would help [13:17:22] if we are talking 200 rows [13:17:27] it is not worth optimizing [13:17:35] if it is 20 million, it is :-) [13:19:30] I will give it a look later in the week, elukey [13:19:54] but all that info is interesting [13:20:02] it is not as much a review, but a collective debuging [13:20:50] so Marcel told me that an estimate of ~2M rows/day for the busiest EL table should be expected [13:20:55] but I'd need to verify it [13:22:13] 2M to delete, to update, etc... how often will this run? [13:23:50] let's say to update, since it will probably be the table with most valuable data [13:24:10] and not sure how frequent, I thought in the beginning once a day but maybe more it could be a good compromise [13:27:53] it takes some time to do that, that is why I asked for more time. if you want to do some analysis in advance, that is something we will gain ,if not, I will do it and it will take me some time [13:28:15] I will ask you patience [13:28:27] because the possibility of data loss is large [13:28:58] and I think last time it was tried, it didn't work very well, so we should be very careful [13:36:05] 10DBA, 10Labs: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#2931704 (10chasemp) #dba gents please let us know when it's ready for views and meta_p fixup and we'll get on it! Thanks [13:37:05] 10DBA, 10Labs: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#3344674 (10Marostegui) Will do @chasemp! The DB isn't yet created. Thanks! [13:51:43] jynus: definitely, thanks for the time! [14:15:48] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3344848 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [14:16:05] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3344851 (10Marostegui) Thank you!! [14:38:36] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3344895 (10Papaul) After creating the case the HP gay is telling me that he can't not place the order because there is an hold on the server and can not tell me what the hold is that another team will contac... [15:03:53] 10DBA, 10Operations, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3344966 (10Marostegui) I just checked db1098 for instance and it still has the same issue indeed so I assume the other ones will remain with the same issue. [18:46:47] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3345698 (10greg) >>! In T167031#3344164, @MarcoAurelio wrote: > @greg Can we get a one hour window on wikitech:Deployments in coordination with @jcrespo and @Marost... [19:25:36] 10DBA, 10Labs, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3345782 (10bd808) p:05Triage>03Normal [19:26:19] 10DBA, 10Labs, 10Labs-Infrastructure, 10Tracking: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3345788 (10bd808) p:05Triage>03Normal [22:02:27] 10DBA: Need to empty "linter" database table on large wikis - https://phabricator.wikimedia.org/T167758#3346324 (10Legoktm) 05Open>03Resolved a:03Legoktm All done! [23:19:01] 10DBA, 10Patch-For-Review: Migrate parsercache hosts to file per table - https://phabricator.wikimedia.org/T167567#3346612 (10tstarling)