[09:08:43] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2886031 (10Marostegui) A loop writing around 1T to a ramdisk didn't make the server crash, just multiple OOM but that is expected. I am going to play with the smp affinity a bit and re do the trans... [10:00:41] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#2886109 (10Marostegui) >>! In T54921#2884583, @Legoktm wrote: > > Yes. {T119154} has some of the historical context about why those tables are there, and... [10:08:47] 10DBA: db1047 paged low on disk space - https://phabricator.wikimedia.org/T153634#2886129 (10Marostegui) [10:22:22] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886208 (10Marostegui) [10:22:58] 10DBA, 06Collaboration-Team-Triage, 06Operations: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#2886223 (10Marostegui) [10:23:01] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886222 (10Marostegui) [10:23:04] 10DBA, 07Epic, 07Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#2428438 (10Marostegui) [10:25:18] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886226 (10Marostegui) a:03Marostegui I will start by making sure they are indeed not written. An easy test without messing up with production (and which is needed anyways to export x1 to dbstore2001 T151552)) will b... [10:30:07] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886208 (10Krenair) Note wikitech is not a database itself, in this context it is a database list containg labswiki and labtestwiki, hosted on silver and labtestweb2001 respectively [10:30:41] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886240 (10Marostegui) [10:30:55] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886208 (10Marostegui) >>! In T153638#2886238, @Krenair wrote: > Note wikitech is not a database itself, in this context it is a database list containg labswiki and labtestwiki, hosted on silver and labtestweb2001 respe... [10:36:42] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#2886247 (10Legoktm) [14:16:00] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2886984 (10Marostegui) I have written a small quick guide to set up netconsole as I needed it for trying to capture messages for this host: https://wikitech.wikimedia.org/wiki/Netconsole [14:18:37] 10DBA, 06Labs, 10Labs-Infrastructure: Create a cronjob/check to run check_private_data data script and report back - https://phabricator.wikimedia.org/T153680#2887002 (10Marostegui) [14:34:14] 10DBA: db1047 paged low on disk space - https://phabricator.wikimedia.org/T153634#2887037 (10Marostegui) We believe this host alerted for "/" and not for the /srv/ partition itself. There are not massive queries running at that time which could cause big temporary tables. But the "/ partition on the other hand l... [14:38:31] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2887039 (10Marostegui) After playing the affinity I could see the interruptions being distributed among the cores: ``` Before: root@db2034:~# cat /proc/irq/{91,92,93,94,95}/smp_affinity 00000000,3... [14:49:53] 10DBA, 06Labs, 10Labs-Infrastructure: Create a cronjob/check to run check_private_data data script and report back - https://phabricator.wikimedia.org/T153680#2887047 (10jcrespo) [15:24:48] jynus: marostegui I have to take part of today, and all of tomorrow off due to some personal things that came up over the weekend. Can we plan on a sync up early my time wednesday? Getting down teh wire but I think we are mostly set just want to coordinate last measure things [15:26:01] chasemp: sure, sounds good. wed depending on the time might be hard for me as I need to go to the hospital, but will do my best to be available [15:26:18] I'll drop a cal apt so we can coordinate easier [15:26:30] :-) [15:27:05] Anyway, jynus knows exactly all the details so do not get blocked on me being late, absent or leaving early anyways. I will try to adapt to you guys [15:27:11] k [15:27:27] chasemp, just 1 question [15:27:31] shoot [15:27:41] we can do a mini-sync up now too :) [15:27:43] would maintain-views --all-databases [15:27:53] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2887096 (10Marostegui) Interesting, we have a new error logged: ``` description=Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000004, Status 0xB20000... [15:27:57] chasemp: that too :) [15:28:05] would do proper things, even if not all dbs are there? [15:28:32] yes, as in, views for waht does exist is handled and no errors reported [15:29:11] it's not ensuring all defined dbs in all dbs are made accessible via views, it's just ensuring all present dbs defined in all dbs are [15:29:16] should I be worried about "WARNING Too big for this database" ? [15:30:12] i believe no, but I'll look at it, it's a protection I think put in from when labsdbs were very modest compared to prod [15:30:26] it's not an error as much as it is a notice some things are considered out of scope [15:30:59] ok, so I will run that and sync with yuvipanda later to test everthing [15:31:34] and I'm thinking post-wed-meeting we will all be ok w/ me opening up the FW from instances to the dbproxies [15:31:44] and we'll have some measure in place to prevent use or will be ok w/ use (either way) [15:32:13] I would delay any official announcement till the developers conf [15:32:18] agreed totally [15:32:22] ^ agreed [15:32:27] silent launch even if we open everything [15:32:28] and ping some users [15:32:37] for testing [15:32:49] (we cannot support things properly during christmas) [15:33:56] totally agreed [15:34:08] we'll tee it all up and be ready for you to mention it at the dev summit :) [15:35:04] jynus: besides putting teh finishing touches on maintain-dbusers and opening up access to teh dbproxies do you know of anything from us you are waiting on? [15:35:07] now I mean [15:35:23] validation is most of it [15:35:27] kk [15:35:27] I do not have a tool [15:35:41] to test both access and grants [15:36:22] gotcha, we'll try to talk to a few prominent users then [15:36:30] marostegui, db1015 has finally went red [15:36:38] :( [15:36:41] let me see [15:37:16] Chasemp jynus I was using quarry-dev as a test tool [15:37:40] (just finished food heading back) [15:37:48] Are all the Dbs there now? [15:38:01] no [15:38:18] only s1 (enwiki) and s3 (~800 smaller wikis) [15:39:36] Ah ok [15:39:41] So I can't switch over quarry yet [15:39:59] jynus: I gave it 30G extra [15:40:19] jynus: I am going to defrag some tables to get some extra G so it can leave us alone for a bit too [15:41:00] I wanted to drop it and do a logical load, even if it takes a month [15:41:19] being many wikis, it can be loaded in parallel very efficiently [15:42:10] yuvipanda: so our timeline is basically, wrap up all things from our side by wed meeting (it's slightly out of your defined working hours but I gut checked that it would be ok) [15:42:29] then we'll soft launch w/ teh data present and do some poking [15:42:35] jynus: the only issue is that it is the only rc server we have and if I do the alters it will get delayed :( [15:43:19] in this server [15:43:37] (shard) partitions are not a thing because the wikis are so small [15:43:59] the idea was actually to leave it with only 3 servers [15:44:11] maybe 4 with dump/vslow [15:44:28] ah, I see [15:45:10] so we have two options: we depool it (remember code freeze), or I silence it and do some alters around to get some "G" back even though it will get some delays [15:46:24] I can alter not the biggest tables, but some of the ones with 500M or so, which shouldn't take much time and given the amount of wiki it has…if we free up around 200M per wiki, that is quite some G back [15:46:36] the problem is that I think many of those servers [15:46:42] do not use file-per-table [15:46:50] hence the reimporting [15:47:30] db1015 does thankfully [15:48:03] ok, then [15:51:59] I will silence it and alter pagelinks then? [15:56:08] chasemp: did maintain-views get run? Am looking at labsdb1009 as a tool user now and don't see any _p databases [15:56:37] (there's a replica.my.cnf in /home/yuvipanda in labstore1004 for testing as a tool btw) [15:57:14] yuvipanda: not by me iirc, jynus was (I believe) looking at doing it [15:57:25] ah, ok [15:57:27] yuvipanda: you can also run it for one db only and then drop that _p if you need to test [16:01:49] hmm [16:01:50] pymysql.err.OperationalError: (1044, "Access denied for user 'maintainviews'@'localhost' to database 'enwiki_p'") [16:02:01] I guess we need to add a grant for that user too? [16:02:23] ah yes seems likely then [16:05:27] ? [16:15:07] your script must have some issue [16:15:20] because it require flush privileges to work [16:17:09] jynus: chasemp hmm, same script as the one that runs in the old labsdbs [16:17:09] what is 'flush privileges'? [16:17:30] yuvipanda, we found the same issue on the old servers, too [16:17:46] until we do some account creation or drop [16:17:56] things do not settle down [16:18:00] now it works [16:18:38] but please check you cannot access the databases from outside labstore1004 [16:19:13] jynus: yeah, I can't access it from tools :) [16:19:22] check now [16:19:32] the cnames [16:19:42] which I cannot remember (I am looking now) [16:20:45] mysql --defaults-file=replica.my.cnf -h labsdb-analytics.eqiad.wmnet enwiki_p -e "SELECT * FROM user LIMIT 10" [16:21:41] that's blocked until we open it up post wed meeting [16:21:51] at least that's my intention [16:21:54] ok ok [16:22:05] then try it from labsdb1004 [16:26:06] jynus: WORKS! [16:26:08] (from labstore1004) [16:26:13] jynus: 'show databases' doesn't show it tho [16:26:35] not sure if that should work [16:26:51] that is what meta_p is about [16:27:04] sorry [16:27:10] information_schema_p [16:27:35] (although they have the same info, in some cases) [16:28:41] yuvipanda, the plan is now to break it [16:28:47] right [16:28:50] we gather all bugs [16:28:54] it -> ? [16:28:54] report everthing [16:28:58] the database [16:29:05] right [16:29:07] (not too hardly,) [16:29:39] do you have suggestions? [16:29:52] try to find things that shouldn't not be there [16:30:02] e.g. first I tried is to select from enwiki [16:30:09] right [16:30:10] there's the labsdb auditor [16:30:23] yes, I wrote one [16:30:33] we already run it [16:30:39] but I am running it again [16:30:48] nice [16:31:13] when we are confident, we end up setting up the other databases [16:31:44] here databases means labadb1010/1 and other schemas other than enwiki [16:31:55] * yuvipanda nods [16:32:22] there is also some "typical queries" on wikitech [16:32:40] we can run them on current and new labsdb [16:32:54] see if there is any error (I would expect errors being corrected on the new ones) [16:32:56] right. I can also run some of quarry's things on it manually and see if the results match [16:33:01] yes [16:33:14] so, I am a bit vague here [16:33:23] because it is general QA [16:33:46] right [16:34:06] also, if we get some performance difference [16:34:25] we can "sell the product" with the new differences [16:34:53] I was thinking of setting up a quarry-dev tool and having that use the new dbs, so it'll be ok with just the wikis we have [16:34:57] but requires that we open it up to labs I guess [16:34:58] and we don't want to do that yet [16:35:31] yes, we can do that as a next step [16:36:39] * yuvipanda nods [16:37:19] there is some tuning I will have to do [16:37:25] regarding user limits [16:38:15] * yuvipanda nods [16:39:29] I can also try to bring down a server to check the proxy works as intended [16:39:43] but only in coordination with you [16:39:45] yes :D [16:40:00] plus also tell you how to handle that [16:40:07] in case I am not around [16:40:31] * yuvipanda nods [16:40:36] I suppose it isn't automatic [16:40:39] it is [16:40:58] but you know, recovery from failover is not automatic [16:41:06] ah, right [16:41:07] and in some cases we may want to handle it manually [16:41:16] etc. [16:41:38] basically, you having all the info [16:41:48] even if ideally you will never use it [16:42:28] * yuvipanda nods [16:42:40] you == all labs team [16:43:12] yup [16:44:01] jynus: this week the labs' team availability is a little slim, and next week most of ops is off, so I guess it'll be first week of jan? or do you want to just test it this week? in which case I can be around most days except fri [16:44:55] I can do it, I just need not to step on anything you do [16:45:13] probably if I do it on my morning, I will not? [17:59:49] jynus: sure! [17:59:59] jynus: we aren't doing anything withotu checking with you first :) [18:00:23] jynus: my next steps are basically to ensure code quality on the new script is fine, and deploy it properly with puppet. I plan on doing this before wednesday [18:00:31] and am on Indian time till thursday [18:00:41] ok [19:37:47] 10DBA: db1047 paged low on disk space - https://phabricator.wikimedia.org/T153634#2887923 (10fgiunchedi) Correct, the page was for `/`. There were some old `atop.log` log files from 2015 I've removed and cleared the apt cache. Other than that the host seems to be creeping up disk space anyways over time. IIRC it...