[05:28:44] 10DBA, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2715369 (10MZMcBride) >>! In T146673#2704843, @jcrespo wrote: > * Does InnoDB FULLTEXT respec... [07:19:00] 10DBA, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2715492 (10Paladox) That task is https://secure.phabricator.com/T10642 [07:20:16] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2715494 (10Marostegui) S7 master `db1041` got its tables removed from the following wikis ``` for i in `cat s7`; do echo $i; mysql -hdb1041 $i -e "set session sql_log_bin=0; drop table... [07:41:31] 10DBA, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2715503 (10hashar) Dropping the #releng tag to limit inbox filling. The task is already in #phabricator and has appropriate... [07:50:38] 10DBA, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2715508 (10Paladox) Looks like mysql has stemming if you do http://oksoft.blogspot.co.uk/2009/04/innodb-disk-fragmentation... [08:14:21] 10DBA, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2715535 (10jcrespo) 05Open>03Resolved a:03jcrespo @Paladox The first 2 links are unrelated. The third is a 7 year old... [08:43:03] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715572 (10Marostegui) I have upgraded to 10.1.18. The first I had to deal with was: ``` ERROR 1275 (HY000): Server is running in --secure-auth mode, but 'root'@'localhost' has a password in the old format;... [08:47:11] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715576 (10jcrespo) > cp mysql /usr/local/bin/ That is done by the package itself, that is very strange. Let me implement https://gerrit.wikimedia.org/r/315228 on 10.1- it should work. [08:52:32] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715594 (10jcrespo) Oh, I see what happens- on install, symbolic links are created on /usr/local, but if you delete a package, those have to be deleted. The solution is to make the packages incompatible, but th... [08:57:58] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715596 (10Marostegui) >>! In T146261#2715594, @jcrespo wrote: > Oh, I see what happens- on install, symbolic links are created on /usr/local, but if you delete a package, those have to be deleted. The solution... [09:19:32] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2655213 (10MoritzMuehlenhoff) Not sure if it's a temporary thing or warrants the work, but the Debian way to manage multiple packages owning a common command is https://wiki.debian.org/DebianAlternatives Let m... [09:40:32] One of the things I do not enjoy about MariaDB GTID different (and non compatible) with MySQL's one, is that mariadb's documentation about multisource+gtid is really poor and you cannot really use mysql one [09:40:36] :( [09:45:39] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715731 (10Marostegui) Thanks @MoritzMuehlenhoff that would actually fix the problem I believe as we can set the priority for the latest package installed and then remove without any problem the new mariadbwmf1... [09:59:00] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715766 (10MoritzMuehlenhoff) The alternatives system does that even automatically in the case of only two alternatives, so if you have wmf100 and wmf101 providing the symlink, and wmf100 gets removed, it will... [10:01:38] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715769 (10jcrespo) I am compiling the new packages with update-alternatives as we speak. However, that may break the existing old hardcoded symlinks. [10:23:38] marostegui, around? [10:23:43] yes [10:23:56] I am uploading the new packages [10:24:08] if there are not other issues, I will upload them to the new repo [10:24:26] just a heads up I will be stopping mysql, in case you are doing something with it [10:24:55] I am talking about dbstore2001 [10:25:16] Ah sure [10:25:18] No worries [10:25:29] I am tailing the log so I will see when you restart it [10:25:31] weren't you in the middle of an import? [10:25:35] No no, not yet [10:25:39] ok, thanks [10:25:41] that was all [10:25:42] I was testing some multisource [10:25:45] but it is all sorted now [10:25:47] so feel free to restart [10:25:51] I will ping you when finished [10:26:08] the pacjages are crude, but hopefuly working [10:26:22] I also want to test systemd with 10.1 [10:29:19] once I test this, I will test things on another host, so we do not collide [10:34:12] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2715861 (10jcrespo) This is now the procedure- it is not very elegant, but it works: ``` root@dbstore2001:/opt$ sudo dpkg -i ~/wmf-mariadb10_10.0.27-1_amd64.deb Selecting previously unselected package wmf-mari... [10:49:48] jynus: \o/ [10:49:56] * marostegui is watching jynus [10:50:21] however, the systemd problem it is not a 10.1 problem, it is a compilation issue [10:50:59] I think we should stay with comatibility mode [10:51:09] until all servers are in jessie [10:51:41] yeah, that is probably the safest thing to do [10:51:56] We are not getting any real benefit at this point on switching to systemd now? [10:52:09] well, the issue is that there is poor support [10:52:24] and even jessie does not support it on mariadb [10:52:52] so doing it ourselves will be a pain, specially with multiple packages and versions [10:53:06] and mysql support is different than mariadb [10:53:19] should I start the slave? [10:53:22] marostegui, ^ [10:53:31] yeah [10:53:37] s3 one [10:53:42] (there is no other slave) [10:53:44] there you go [10:53:58] it enables parallel replication by default [10:54:04] I am a bit worried about that [10:54:12] it caused issues in the past [10:54:37] but let's give it a try [10:55:06] I am going to upload the package now [10:55:12] cool thanks [10:55:23] but we we will not update any server to it [10:55:29] no no [10:56:00] specially until I provide a systemd unit or probably, an init file [10:56:07] on puppet [10:57:12] I will leave dbstore alone for you [10:57:33] did chris moved one to be reimaged? [10:57:41] I would like to test there [10:57:47] db1053 you mean? [10:58:10] db1053 is now part of s1 but will be moved to s4 to replace db1019 [10:58:13] so it needs tobe reimaged [10:58:57] yes, I will take control of that [10:59:02] oki [10:59:03] for testing [10:59:07] sounds good [10:59:11] so I do not interfere with the imports [11:00:10] ok, thanks :) [11:00:55] tendril: "dbstore2001 10.192.0.32 10.1.18 32G 10m 4s 1 Yes 0s" :-) [11:01:16] 10DBA, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2715893 (10Aklapper) >>! In T146673#2715508, @Paladox wrote: > Looks like mysql has stemming if you do @paladox: Please st... [11:01:26] \o/ [11:25:56] 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 3 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2715913 (10Marostegui) Looks like MariaDB assigned to bug to someone already (after me asking for an update: ``` Elena Stepanova reassigned MDEV-... [11:30:08] 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 3 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2715915 (10jcrespo) I think we specifically disabled parallel replication? Maybe multisorce is the cause?, in which case it is still a bug for me. [11:35:38] 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 3 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2715921 (10Marostegui) Yes - I am unsure why she said so: ``` MariaDB SANITARIUM localhost (none) > show global variables like 'slave_parallel_mo... [11:36:59] I may reimage db1053 later [11:37:23] it is trusty and I need jessie [11:37:25] ok :) [12:28:25] no ticket for db1035? Maybe T147305 ? [12:28:26] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [12:28:39] db1035? [12:28:54] arg [12:28:57] db1053 [12:28:59] Ah [12:29:34] maybe the db1019 decom [12:29:47] No, we didn't create one you are right. We created the db1019 one and spoke about about db1053 replacing it, but I am not sure if we updated a ticket with it [12:30:09] I will do it as part of T147309 [12:30:10] T147309: Decommission db1019 - https://phabricator.wikimedia.org/T147309 [12:30:12] https://phabricator.wikimedia.org/T147774 this was only for DCOps [12:30:20] but that is for dc-ops [12:30:21] Sounds good [12:30:28] I do not want to make that more difficult [12:30:46] I will do it as T147305, as technically is part of commons [12:31:10] ok [12:31:20] db1019 is still waiting for Chris to finish his part [12:32:14] yes, last thing I want is to confuse other ops [12:32:26] yeah, no way [12:32:48] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2688288 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1053.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610141232_jynus... [12:35:11] I think it is going to fail [12:38:17] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2716254 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1053.eqiad.wmnet'] ``` Those hosts were successful: ``` [] ``` [12:38:23] yep [12:40:29] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2716270 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1053.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610141240_jynus... [12:40:32] Did it hang in icinga? [12:40:40] because it took 5 minutes as per the logs [12:40:40] yep, I am removing the downtime [12:40:42] to move forward [12:40:43] ahd [12:40:49] Yeah, I got that remember? [12:40:55] but that is a bug [12:41:06] Yeah, maybe we need to pay a visit to volans :p [12:41:46] T145192 [12:41:46] T145192: icinga-downtime script waiting forever if host already in downtime - https://phabricator.wikimedia.org/T145192 [12:51:05] https://phabricator.wikimedia.org/p/jcrespo/ awarded a token. XDDDDDD [12:51:15] in that ticket XD [12:51:49] you're trolling me guys [12:51:56] it is friday! [12:52:39] marostegui, you are wrong- every day is trolling day [12:55:43] trotfl [13:06:11] we need to speed up the dbstore disks decision [13:06:29] they were supposed to go through last quarter [13:06:35] we are already late [13:06:36] jynus: For me 12x2TB sounds good [13:06:38] no? [13:06:58] I rather go for 12x2TB in both instead of one with 12x2TB and another one 12x3TB [13:07:18] thoes that fit our predictions? [13:07:28] I cannot remember them [13:07:33] I am reading them and look at this [13:08:02] 12x2TB disks would give us 11.7TB usable space (instead of 6.5T) [13:08:02] • Usable life: 2-3 years [13:08:15] 16x1.8 TB - 14.4TB of usable space (instead of 6.5) [13:08:15] • Usable life: 3-5 years [13:08:43] Rob said: [13:08:43] I think there may be some confusion, based from stating just go with the 12 disks. The two systems aren't identical, so their upgrades paths will differ. [13:08:47] So the cheapest option to upgrade dbstore1001 is the 3TB * 12. [13:09:13] I do not understand that [13:09:19] yeah, me neither [13:09:25] I will ask for clarification [13:09:26] we should ask him [13:09:33] I will do it now [13:09:56] he is up later [13:10:01] pacific [13:10:08] I meant on the ticket :) [13:11:33] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2716544 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1053.eqiad.wmnet'] ``` Those hosts were successful: ``` ['db1053.eqiad.wmnet'] ``` [13:11:49] \o/ [13:13:06] given that 1 year ago this was a multi-step that took hours and multiple manual steps, I have faith for the future [13:14:13] jynus: A year ago the process was the same one I did with you during my first week? [13:14:39] no, worse [13:14:55] I had to connect to the serial console and say yes a couple of times due to a bug [13:14:59] :-) [13:15:07] hahaha [13:16:43] https://phabricator.wikimedia.org/P4221 [13:17:42] moritzm, it has dependencies and everything! [13:17:48] haha [13:17:56] It is not yet a real package, but it almost looks like one [13:18:00] :) [13:18:02] well done! [13:19:15] nice! [13:57:07] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2716783 (10Marostegui) Importing the tablespace failed for one table (enwiki/change_tag`and crashed the whole server.) ``` 2016-10-14 13:47:38 140429149276928 [Note] InnoDB: Sync to disk 2016-10-14 13:47:38 14... [13:57:17] jynus: In which state did you leave dbstore2001? Should I start MySQL with /etc/init.d/mysql start? [13:58:39] I am asking because of this: https://phabricator.wikimedia.org/P4223 [13:58:46] it should be up [13:59:00] yeah, I scraped the systemd [13:59:12] Yeah, it was up until it crashed when importing the tables, so I need to do some troubleshooting and want to bring it back up [13:59:14] continue using inet.d for now, I am creating a patch [13:59:25] it is up [13:59:43] crashed when importing the tables? [13:59:54] Yep [13:59:55] I asked you if you were using it [14:00:00] you said no [14:00:01] No no [14:00:11] I started to import once you were finished :) [14:00:11] this was not a crash, but a regular shutdown [14:00:21] oh, sorry [14:00:23] I see now [14:00:25] Sorry, I think I didn't communicate correctly [14:00:27] Ah yeah :) [14:00:48] yeah, what it needs is the basedir [14:00:51] dear marostegui and jynus :) do you ahve time to throw some wiki's on labsdb1008 today? and if I could specifically get olowiki and if I could get adywiki/jamwiki that would be excellent [14:00:52] yeah [14:01:00] Yeah, so I will fix that [14:01:05] chasemp, it was on my todo [14:01:06] Just making sure you did that yourself [14:01:24] I will do it now after mark let's me go with the presentation [14:01:45] jynus: Is that a mysqldump and move it to labsdb1008? [14:01:47] jynus: thanks man! specific call out for olowiki as the newest actual needs to be run in prod so I can test w/ it and adywiki/jamwiki as the most recent victims of the old version so I can do a bit of compare [14:01:58] ok [14:02:07] chose a list, put them somewhere [14:02:10] chasemp, [14:02:33] marostegui, yes [14:02:42] from an existing labsdb [14:03:45] or db1069 then drop all views or triggers [14:03:45] labsdb1008 is also not replicating, so no need to grab the slave position or something right? [14:03:45] ask chase if they need replication [14:03:45] of just the tables, static [14:03:47] A point in time view should be fine [14:03:55] as long as the schema is current etc [14:04:55] so I did a bit of revision https://gerrit.wikimedia.org/r/#/c/315534/, madhu is reviewing currently [14:11:06] chasemp: olowiki is now at labsdb1008 for you [14:11:26] sweet, that was quick [14:11:39] I took it from labsdb1001 [14:16:14] yes, the small ones are easy [14:16:19] it is just starting to do it [14:16:28] which I have been starting for 2 weeks [14:16:44] chasemp: jam and adywiki are also added now [14:17:01] :-( [14:17:11] cool, that should give me some context thanks guys [14:17:17] jynus: no worries man! [14:17:31] thanks for cc'ing me on the other task w/ mariabdb 10.1 stuff too [14:17:32] jynus: That is why I am here! To try to offload you a bit :) [14:17:41] following along but nothing of value to contribute atm :) [14:17:59] we are testing 10.1 now [14:18:11] we want to be sure it is stable [14:18:34] we have to later check the spefics for labs: roles and row-based triggers [14:20:07] yep [14:28:19] I think I added you to "Reimage dbstore2001 as jessie"- that is where the hot thigs about the goal are happening [14:45:59] jynus: I might be missing something obvious, but after changing the basedir for the init.d/mysql it is still complaning and not starting. https://phabricator.wikimedia.org/P4223 [14:46:08] Fior now I have started it manually as I want to do recovery first [14:46:17] I will spend time on the script later, but just FYI [15:20:46] so the problem is that I added a systemd unit for testing, and now it wants to manage all settings [15:21:33] haha [15:22:10] even if I disabled it afterwards [15:24:55] systemd unit has become self aware [15:27:57] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2716936 (10Marostegui) I started mysql with innodb_force_recovery = 1 to keep trying to import the tablespaces from the table it failed. I was able to keep advancing on the tables, but it crashed again. ```... [15:36:31] you should not try to import tables to a different version server [15:36:39] go for 10.0 [15:36:48] I do not know why you chose 10.1 [15:37:09] To try :) [15:37:21] it is ok to try [15:37:27] but we have a goal to meet [15:37:59] Yeah, I know but given that we had 10.1 installed and there, I thought I would try [15:38:06] no [15:38:07] But no worries, I will downgrade and try same versions [15:38:20] I do not think you will be able to downgrade [15:38:37] I am mostly sure you will have to start from 0 [15:38:52] which is the part I do not like about the testing, too much wasted time [15:39:37] go 10, and then if you want, do a snapshot and upgrade to 10.1 [15:40:08] we probably have 5.5 format tables [15:40:14] even on 10 [15:40:18] which may cause issues [15:40:21] Yep, I will install the 10.x package and copy the data over again that is the longest part [15:40:47] that is why I suggested keeping an offline copy [15:41:20] you should have now plenty of space [15:41:55] yeah, there is plenty indeed [15:42:39] regarding systemd, I will double check how to disable it for good :p [15:42:48] So it doesn't mess with the init.d script [15:43:21] it is envirnonement variables [15:43:36] you can use "systemctl mask mysql.service" (or whatever it's called) [15:44:08] ah nice - I haven't played a lot with systemd myself so this is good to know! [15:44:59] jynus: Was I correct in reading (somewhere?) that modules/snapshot in puppet is going away? [15:45:24] I did not touch such a thing [15:45:39] Ok, must've confused with something else, nvm :) [15:46:44] there is some problem with the binary [15:47:01] for some reason, it doesn't daeomonize correctly [15:47:25] I thought it was systemd, but it happens with init.d, too [15:47:40] Oh really? [15:47:49] 10.1? [15:47:58] all versions [15:48:04] :| [15:48:18] maybe it is the custom-deployed mysqld_safe ? [15:53:23] no, they are exactly the same file [15:55:50] what is the problem you've observed? [15:56:20] mysql doesn't daemonize itself [15:58:27] Not really related but interesting: https://jira.mariadb.org/browse/MDEV-11046 [15:58:43] I was quickly going thru their KB to see if there was anything mentioned that could be similar to what youare seeing [15:59:38] I think I got it [15:59:56] tell me! [16:00:03] line break [16:00:08] for the & [16:00:15] :| [16:04:10] but if I fix that, now the server timeouts? [16:04:42] as in, runs normally, but the script does not detect it [16:05:01] that is weird [16:05:15] it has its pid [16:05:18] which box are you testing in db1053? [16:05:19] its socket [16:05:22] I can connect [16:05:36] the log is right, just after connections are allowed [16:05:48] but the init does not stop [16:07:03] "if $bindir/mysqladmin ping >/dev/null 2>&1; then" [16:07:08] I do not think that will work [16:07:23] and I am using the upstream init file [16:31:13] the packages are wrong [20:19:24] 10DBA, 03Community-Tech-Sprint: Create a maintenance script for populating the local_user_id and global_user_id fields in the centralauth localuser table - https://phabricator.wikimedia.org/T142503#2537232 (10Mattflaschen-WMF) [20:25:23] 07Blocked-on-schema-change, 06Community-Tech, 13Patch-For-Review, 07Schema-change: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951#2517600 (10Mattflaschen-WMF) > It has been applied and tested on Beta Cluster. Doesn't seem... [21:37:16] 10DBA, 10Beta-Cluster-Infrastructure, 06Operations: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#2718307 (10Mattflaschen-WMF) This caused complications when trying to fix {T148111} too. @bd808 accidentally ran the ALTER on the slave, I then ran it on master, but... [21:37:28] 10DBA, 10Beta-Cluster-Infrastructure, 06Operations: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115#2718311 (10Mattflaschen-WMF)