[05:55:55] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3302817 (10Marostegui) [05:58:06] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3302818 (10Marostegui) [05:59:36] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3302831 (10Marostegui) db1081 is done: ``` root@neodymium:/home/marostegui# for i in `cat s4_tables`; do echo $i; mysql --skip-ssl -hdb1081 commo... [05:59:52] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3302832 (10Marostegui) [07:01:03] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3302895 (10Marostegui) 05Open>03Resolved This is now back to Optimal ``` root@db1046:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id:... [07:34:11] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team (Next): beta cluster databases have almost full disks - https://phabricator.wikimedia.org/T166060#3302946 (10hashar) @demon connected on each of the beta database on May 23 for: ``` lang=irc 2017-05-23 16:55 dropped flow_ext... [07:34:58] marostegui: good morning. If you ever get a minute or so to explain to me what sql bin files are for and how to garbage collect them. I could use some help :-} [07:35:10] nowhere near urgent though :-] [07:35:23] Hey hashar the sql log bin files you mean? [07:35:44] that is for the beta cluster database, one instance has bunch of files such as /srv/sqldata/deployment-db03-bin.000016 [07:35:48] each 1.1GBytes [07:35:50] Right [07:35:54] https://phabricator.wikimedia.org/T166060#3302946 ;D [07:35:54] Those are the binlogs then [07:36:06] that is to replay transactions isn't it? [07:36:24] yeah, those contains all the statements that come from the master (if there is a master) [07:36:40] checking the task [07:37:10] at some point the instance disk got almost completely filled due to those bin files. There is magically less today :d [07:37:12] is there any replication between those hosts? [07:37:18] so I guess there is some garbage collection going on at some point [07:37:27] yeah one is the master, the other is a replica [07:38:01] Yes, if the replica is behind for some reason, it gets the relay logs piling up until it catches up and then starts to clean them oout [07:38:27] Also the master has a flag to know when to expire the log bin, so, binlogs older than XX days gets deleted [07:38:51] so supposedly that is self healing :-} [07:39:01] How can I connect to those hosts? [07:39:10] (meanwhile I noticed the mysql CLI spurts results with color!!!!) [07:39:30] master is: ssh deployment-db03.deployment-prep.eqiad.wmflabs [07:39:36] It is not self healing really [07:41:37] hashar: on the slave can you run: show slave status\G and paste it somewhere? [07:43:24] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team (Next): beta cluster databases have almost full disks - https://phabricator.wikimedia.org/T166060#3302963 (10hashar) deployment-db04: ``` root@BETA[(none)]> SHOW PROCESSLIST \G *************************** 1. row ***************************... [07:43:29] marostegui: https://phabricator.wikimedia.org/T166060#3302963 [07:43:46] to connect to labs instance maybe I need to add you as a project member :D [07:43:53] Right, so if you see this [07:43:57] ops also need a specific bastion [07:44:00] Seconds_Behind_Master: 0 [07:44:03] the slave is up to date [07:44:10] and it is currently running: Master_Log_File: deployment-db03-bin.000053 [07:44:22] So you would be safe to delete the master binlogs up to that one [07:44:40] I would be carefull and: copy up to 40 to somewhere else (just in case) [07:44:43] and then on the master run [07:44:57] purge binary logs to 'deployment-db03-bin.000040'; [07:45:14] does it happen to purge them automatically after X time? [07:45:41] run this on the master: show global variables like 'expire_logs_days'; [07:46:04] 30 [07:46:08] right [07:46:15] so it will keep 30 days of binlog [07:46:24] which I think you cannot afford :p [07:46:37] you can change that with set global expire_logs_days = 7 (for 7 days) [07:46:40] or whatever you think you need [07:46:41] yeah I am preparing the puppet patch [07:46:46] make sure to change that on the my.cnf too [07:47:11] you can run the set global directly on the mysql prompt [07:47:42] but yes, change my.cnf too or otherwise it will get lost once you restart mysql (if you do for any other reason) [07:48:58] marostegui: you are awesome :-} [07:49:13] haha no way! just a dba! [07:49:32] will you update the ticket or you want me to? (just saw you added DBA) [07:50:11] https://gerrit.wikimedia.org/r/356337 beta: keep less mysql bin logs :-} [07:50:17] and I did the set on the master [07:50:26] so I guess that is covered and from now on the bin log will be collected [07:50:36] +1 ed [07:51:12] I am not sure if they get removed if you lower it on the fly [07:51:25] you might need to run the purge command i sent earlier [07:51:34] definitely. I will do it [07:51:50] the slave is up to date, so it is safe [07:51:52] however [07:52:02] before doing it, always make sure the slave is up to date [07:52:17] otherwise if youi delete a binlog which the slave hasn't arrived yet, it will get disconnected from the master [07:52:28] surely one does not want to delete bin logs that are yet to be processed by the slave right? [07:52:32] or the replication ends up broken? [07:52:36] exactly [07:52:45] that is why I said: copy them somewhere else first [07:52:51] and then delete up to the 40 or something like that [07:53:06] There is really no need, as we have seen it is up to date [07:53:16] But I am always really careful you never know! [07:54:51] that is why I am not a DBA. I am not careful enough :] [07:55:16] puppet patch applied and does what is intended. I guess you can merge the patch if you get time for the merge dance [07:55:26] sure [07:56:13] merged [07:57:04] \O/ [07:57:15] \o\ |o| /o÷ [07:57:20] Time to run the purge on the master then? [07:58:33] yes [07:58:34] 10DBA, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Next): beta cluster databases have almost full disks - https://phabricator.wikimedia.org/T166060#3302974 (10hashar) a:03Marostegui Aced by Manuel in less than a minute. The root cause is the master was expiring the bin... [07:58:46] 10DBA, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Next): beta cluster databases have almost full disks - https://phabricator.wikimedia.org/T166060#3302976 (10hashar) 05Open>03Resolved [07:58:52] moving the files [07:58:55] and running the purges [07:58:59] don't move them [07:59:01] copy them [07:59:03] i mean [07:59:05] cp instead of mv [08:02:31] marostegui: works like a charm :-) [08:02:36] \o/ [08:02:50] Remember the drawbak of the change [08:03:04] if for whatever reason the slave gets behind for more than 7 days it will get disconnected from the master [08:03:15] if you see that coming, set global expire_log_days = 14 [08:03:17] or whatever [08:03:28] to gain some room [08:03:50] I dont think we monitor the slave being lagged [08:03:59] so most probably we would end up having to resync it entirely [08:05:24] Ah right, up to you guys in that case :) [08:06:06] 10DBA, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Next): beta cluster databases have almost full disks - https://phabricator.wikimedia.org/T166060#3302990 (10hashar) For the record, following Manuel instructions: On the master (db03) I have copied the bin files 000 to 0... [08:07:10] marostegui: thank you very much. That is one less problem :-} [08:07:51] No worries! Happy to help [08:07:51] 10DBA, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Next): beta cluster databases have almost full disks - https://phabricator.wikimedia.org/T166060#3302991 (10Marostegui) >>! In T166060#3302990, @hashar wrote: > For the record, following Manuel instructions: > > On the m... [08:37:37] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3303045 (10Marostegui) So, given the issue with the compressed tables and if we still want to use db1070 - this is the procedure I have thought to get it o... [09:05:57] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300419 (10Marostegui) I have checked all the wikis listed on dblists and the one I have found missing... [09:24:22] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3303145 (10Marostegui) [09:55:43] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3303196 (10jcrespo) Don't use mysqldump- mydumper will be faster, and works great. Just make sure you do not backup (or remove later) common dbs like mysql... [09:57:36] jynus: I was playing with mydumper a bit yesterday, how do you bypass ssl connection error? [09:57:52] I tried locally on db1070 just to play around and check its options [09:59:23] yes [09:59:31] I did something, but I cannot remember what [09:59:34] haha [10:00:08] I think I edited the my.cnf file and started the program- it is only read at start [10:01:03] ah right [10:01:05] let me try [10:01:27] remember to compress and to tune the threads [10:01:39] editing the file works indeed [10:01:40] I think I had a paste with the command line somewhere [10:01:49] yes [10:01:51] i was trying this [10:01:58] -c -h localhost -t 8 -u root -r 100000000 -B dewiki [10:03:23] that looks about right [10:03:37] the creation should be rather fast [10:03:56] oh, you may miss wikidatawiki? [10:04:07] yeah, that was only my initial test :) [10:04:17] ah, ok [10:05:42] https://bugs.launchpad.net/mydumper/+bug/912432 [10:06:44] it may not be in jessie's mydumper [10:07:17] then it will be just --defaults-file=/dev/null [10:07:54] I will ack db1047 s1 lag [10:12:17] actually, it probably reads already the mydumper section, so we could tune it with that [10:12:51] thanks [10:13:12] I started working with it, but then I stopped at some point [10:13:26] but functionality wise, it worked nicely [10:13:38] I have done some tests now and god…it is fast indeed [10:14:30] you can thank the author later in the day [10:14:40] jynus: hello! Is there a up to date version somewhere of your python WMF lib to connect to dbs ? I'd like to keep my code as close as possible to re-use yours when ready [10:14:53] date version? [10:15:13] ah, up to date [10:15:38] latest version is on https://gerrit.wikimedia.org/r/#/c/354206/ [10:15:43] not yet released [10:21:10] super thanks [10:21:59] yeah, that still apparently has /tmp hardcoded, I need to change that to reading my.cnf, as I do for labs [10:22:16] and that is why it is not released yet [10:56:40] marostegui: I was about to reimage db2044, but I see it has been downtimeed for a long time, any problem with it? [10:56:50] not that I recall [10:57:03] could be just beacuse of alter table or something? [10:57:18] it is from the 26? [10:57:29] Ah, maybe that is from the day that we lost icinga alerts [10:57:32] and i downtimed it for a long time [10:57:40] as i didn't know when you'd take care of it [10:57:40] ok [10:57:48] but from my side, I am not doing anything with it [10:58:18] ok, I am reimaging it,so backup happens while I am away [10:58:24] cool [12:29:43] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3303684 (10Marostegui) [12:54:50] db2035 and db2041? [12:55:16] it seems it is only a soft state for now [12:55:23] on the new checks [12:57:10] Yeah,. it was a new check that mutante set in the morning [12:57:27] We already discussed that it was too "intense" [12:57:42] I think volans or alex was going to change the interval for it [12:58:28] I just talked, ema did the work, it's already every 30m with retry every 10m and timeout at 30s [12:58:35] ah ema :) [12:59:39] and this is exactly why I didn't try to fix the writeback check for all servers, but decided to make it opt-in [12:59:46] it is very difficult to get right [12:59:53] and we have already many ipmi bugs [13:10:18] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3303799 (10Marostegui) @Cmjohnson db1048 is now down and ready for you to swap the BBU Thanks! [13:18:24] I have put this together https://wikitech.wikimedia.org/wiki/MariaDB#Dumping_tables_with_mydumper just to have it there for future references [13:21:04] love it manuel! [13:21:13] all my fault for not documenting it [13:21:29] nope, not your fault you are running 100mph [13:21:50] maybe we should create a subpage por cloning backups and recovery laready= [13:21:52] Now it is there, poorly documented, but at least the basics [13:21:54] *for [13:26:07] yeah, maybe that makes sense [13:26:16] I would leave it for now [13:26:50] actually, that may exist already, let me check [13:27:38] there is https://wikitech.wikimedia.org/wiki/MariaDB/ImportTableSpace [13:28:15] oh, I see [13:28:25] you did the long one, and then liked there [13:28:43] Yeah, I think so [14:09:14] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3304016 (10Reedy) Extension is enabled there, for one reason or another I've run the ALTER TABLE ther... [14:23:31] btw, db2044, same issue than db2049 [14:23:37] but db2048 worked [14:31:13] 10DBA, 06Operations, 10ops-codfw: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304126 (10jcrespo) [14:44:38] 10DBA, 06Operations, 10ops-codfw: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304167 (10jcrespo) Holding this until the second reinstall fails or succeeds. This is the list of future reimages: https://gerrit.wikimedia.org/r/#/c/356387/ Should we... [14:47:55] 10DBA, 06Operations, 10ops-codfw: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304179 (10jcrespo) a:05jcrespo>03Papaul Confirmed if fails consistently to boot after install and 99.9% sure that a BIOS upgrade will fix it. Please, papaul help us... [14:48:20] I suppose you sawP [14:48:21] ^ [14:48:41] will now have a proper look at access logs [14:48:58] 10DBA, 06Operations, 10ops-codfw: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304126 (10MoritzMuehlenhoff) I think we should preemptively upgrade the BIOS on all servers of that order/batch. We can't rule out that some of the symptoms fixed in th... [14:53:29] jynus: the part that get stuck is in wmf-reimage (the old bash script) in the sign_puppet function that waits forver for a cert to sign to magically appear on the master [14:55:53] ok, so that will disappear [14:56:16] I can quickly add a limit that once reached abort and exit, but how long is long enough? also if you restart a run while the old one is still stuck, there will surely be race conditions on wether the old process or the new one will pick the cert and sign it [14:56:26] you can (should) unsubscript to avoid the rest of the spam [15:12:20] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 13Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3304261 (10hoo) [15:36:57] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304400 (10Cmjohnson) replaced the battery with a well used one from a decom'd db. Hopefully this will work for long enough. Server has been powered on [15:39:38] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304407 (10Marostegui) Thanks Chris. The battery is now charging ``` Battery State: Optimal BBU Firmware Status: Charging Status : Charging... [15:40:19] BBU take N+1, let's see how it goes ;) [15:40:29] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304409 (10Marostegui) 05Open>03Resolved I will mark this as resolve again and let's see how long it lasts ``` root@db1048:~# megacli -ldinfo -l0 -a0 |... [15:49:07] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304451 (10Marostegui) I will revert the DNS patch tomorrow morning once the battery has recharged and all that. [15:54:03] 10DBA, 06Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10jcrespo) [15:56:32] 10DBA, 06Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304486 (10jcrespo) {P5513} [16:04:06] 10DBA, 06Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304501 (10jcrespo) [16:05:08] 10DBA, 06Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10jcrespo) Please note that **the ticket us public, but the list of ips is not**, do not take private data from the ips list and copy it here. [16:13:32] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3304533 (10jcrespo) a:03MaxBioHazard So according to Reedy, Marostegui and demos, this seems like a... [16:37:26] 10DBA, 06Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3304647 (10Volans) [17:14:12] 10DBA, 103d, 06Multimedia, 13Patch-For-Review: Have search recognise STL files as a new kind of media file ('type:3d' or whatever) - https://phabricator.wikimedia.org/T157348#3304918 (10dr0ptp4kt) p:05Normal>03High We'll be circling on this one soon. [17:33:41] 10DBA, 07Tracking: Cleanup x1 database connection patterns - https://phabricator.wikimedia.org/T164504#3305049 (10Addshore) [18:11:15] 10DBA, 06Operations, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3305195 (10Cmjohnson) @Marostegui is this still an issue? [18:52:03] 10DBA, 06Operations, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3305474 (10Marostegui) @Cmjohnson yes, check the original task description so you can see that there are a few servers that cannot be installed :-( [19:15:41] 10DBA, 06Operations, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3305582 (10jcrespo) This started at exactly 3am and ended at exactly 18pm. All very weird. [21:05:22] 07Blocked-on-schema-change, 06Community-Tech, 10MediaWiki-Database, 07Hindi-Sites, and 3 others: Allow comments longer than 255 bytes - https://phabricator.wikimedia.org/T6715#3305976 (10Anomie) [21:05:44] 07Blocked-on-schema-change, 06Community-Tech, 10MediaWiki-Database, 07Hindi-Sites, and 3 others: Allow comments longer than 255 bytes - https://phabricator.wikimedia.org/T6715#1302388 (10Anomie) [21:06:52] 07Blocked-on-schema-change, 06Community-Tech, 10MediaWiki-Database, 07Hindi-Sites, and 3 others: Allow comments longer than 255 bytes - https://phabricator.wikimedia.org/T6715#1397647 (10Anomie)