[00:23:32] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3120457 (10jcrespo) > Is it a good idea to attempt the handling of such problems in the DB layer No, of course it is not. But this is the l... [06:38:06] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3120803 (10Marostegui) db2037 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2037.codfw.wmnet commonswiki -e "show cre... [06:39:15] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3120804 (10Marostegui) db2037 is done: ``` root@neodymium:~# mysql... [07:03:47] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3120812 (10Marostegui) And the version is also different: ``` root@s5-master[wikidatawiki]> select version(); +---------------------+ | version() | +---------------------+ | 10.0.22-MariaDB-lo... [07:04:31] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3120813 (10Marostegui) Mmm, looks like dbstore2001 is now giving the correct plan: ``` root@dbstore2001.codfw.wmnet[wikidatawiki]> EXPLAIN DELETE /* Wikibase\Repo\ChangePruner::pruneChanges */ FROM `wb... [07:25:31] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120835 (10Marostegui) I think it has been wiped as it doesn't even show the GRUB after selecting to boot from disk. As I said, the hard disks are being show in the RAID and BIOS menu. ``` 0 Non-RAID Di... [07:30:32] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120836 (10Marostegui) Also tried to reinstall grub just in case it was the only thing deleted, but also failed on that. So maybe it was indeed reimaged and when I stopped it, was already half way thru... [07:35:07] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3120846 (10Marostegui) Pending host in codfw: db2019 (the master),... [07:52:25] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120884 (10Marostegui) The new mainboard is configured to always boot from PXE. ``` System BIOS Settings > Boot Settings > BIOS Boot Settings Boot Sequence... [08:21:03] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120905 (10Marostegui) p:05Normal>03High a:05Marostegui>03Papaul Before doing a proper reimage, we need to change the boot sequence to boot first from disk and if not, from the NIC. I am not bei... [08:37:33] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120914 (10Marostegui) @MoritzMuehlenhoff kindly help and suggested: `racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD` Which I tried, but had not effect on the boot order: ``` /admin1-> r... [08:44:53] 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3120920 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1087.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimag... [09:06:40] 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3120964 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1087.eqiad.wmnet'] ``` and were **ALL** successful. [09:23:02] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3120985 (10Marostegui) db1091 is done: ``` root@neodymium:~# mysql... [09:24:37] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3120987 (10Marostegui) codfw is only pending the master. I will do it once I am done with eqiad hosts (which I have started... [09:57:29] 10DBA: Defragment: db1091, db1084, db1081 - https://phabricator.wikimedia.org/T161088#3121070 (10Marostegui) [09:58:33] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3121087 (10jcrespo) p:05Triage>03Low I converted it to dynamic row format- whether it was the compression or just the reconstruction is yet to see. [10:03:35] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121113 (10jcrespo) Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had with some of the servers in the past. The r... [10:12:01] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121204 (10Marostegui) >>! In T160242#3121113, @jcrespo wrote: > Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had... [12:07:20] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121607 (10jcrespo) Can I reimage the server? https://gerrit.wikimedia.org/r/344108 [12:08:14] jynus: ^ go ahead [12:08:34] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121612 (10Marostegui) Go ahead [12:09:09] I will reimage it with that config as that should be the workflow [12:09:15] create then a file on /srv [12:09:18] reimage again [12:09:26] with the regular recipe [12:09:31] the new one, I mean [12:09:36] and see if it works [12:09:51] it can be a big win [12:09:52] it if works fine [12:10:13] I have no idea what I am doing, we will see [12:10:18] no more copy data to somewhere else [12:10:19] hahaha [12:10:24] well, actually [12:10:29] that was on purpose [12:10:38] 1) precise hosts needed repartition [12:10:53] 2) it is dangerous and unsafe [12:10:58] see revent db1057 [12:11:01] *recent [12:11:12] if I hadn't copy it, we would have lost its data [12:11:17] yeah [12:11:22] that's true [12:11:36] chris hasn't poked it yet, no? [12:11:45] once no more precise hosts (or precise hosts badly upgraded to trusty) are here [12:11:54] we can consider it on a case by case bases [12:12:01] yep [12:12:09] but it shopuld be the default for big big servers [12:12:16] es, dbstore, sanitariums, etc [12:12:18] yes [12:12:34] but again, last time we did it was to wipe old data too [12:12:46] yes, last time it made total sense [12:12:46] and a backup was needed anyway- see dbstore1001 [12:13:08] so again, it was not like that on purpose [12:14:09] once a proper automatic provisioning system is in place [12:14:14] and everthing on jessie [12:14:19] probably is the way to go [12:14:26] but backups may still be needed [12:14:34] yeah, agreed, reinstall / but not srv by default [12:14:38] and we can always make exceptions [12:14:54] this is actually not about that [12:15:01] it is about not wiping data by default [12:15:18] yes, that is what I meant [12:15:22] if for some reson it got on install [12:15:33] a puppet change would be needed to remove /srv [12:15:53] yes yes, I wasn't clear, with /srv I meant its content [12:16:18] I am not sure this will work [12:16:32] but easiest way is to try I think [12:16:48] and there is nothing to lose on an actual host [12:16:57] yep, we have a host we can play around with :) [12:20:56] wmf-auto-reimage -p T160242 es2015.codfw.wmnet [12:20:57] T160242: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242 [12:21:01] marostegui^ [12:21:03] ok? [12:21:45] yes [12:21:46] lokos good [12:21:50] looks [12:22:28] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3092963 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['es2015.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/2017032... [12:25:04] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3121643 (10Marostegui) db1084 and labsdb1010 are done: ``` root@neodymium:~# for i in labsdb1010.eqiad.wmnet db1084.eqiad.... [12:26:19] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3121660 (10Marostegui) db1084 and labsdb1010 are done: ``` root@neo... [12:47:44] 10DBA, 06Labs, 10Labs-Infrastructure: Explore 'Analyze' statement as substitute for Explain - https://phabricator.wikimedia.org/T141095#3121693 (10jcrespo) https://tools.wmflabs.org/tools-info/optimizer.py no longer works, and that is a problem for users wanting to EXPLAIN their queries. [13:00:41] 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#3121712 (10jcrespo) [13:01:31] 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#2491529 (10jcrespo) [13:01:54] 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#2491529 (10jcrespo) [13:04:20] 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#2491529 (10Marostegui) Good example of a db server where that happens with big alter tables: dbstore2001 [13:17:22] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3121750 (10Marostegui) So far so good! ``` root@EVENTLOGGING m4[log]> show tables like 'ImageMetr... [13:25:54] wmf-reimage failed badly for es2015 [13:26:00] :( [13:26:03] I had to kill it server [13:26:04] the partitioner? [13:26:06] I am doing it manually [13:26:06] no [13:26:11] the salt call [13:26:14] oh [13:26:20] gets stuck if there is not salt previously [13:26:33] db1057 was also being installed still [13:26:35] did you use —new? [13:26:46] no, because technically, it was not new :-) [13:27:03] I had to kill db1057 install process on puppetmaster [13:28:15] last time I had an issue when it got stuck on the salt thing, it was actually the IPMI not working [13:28:27] it was not that [13:28:27] I have checkd and es2015 isn't here https://phabricator.wikimedia.org/T150160 [13:28:39] IPIMI worked fine [13:28:50] although I had to reboot it manually, too [13:29:31] I am going to lunch, did you go aleady? [13:29:38] i did yep [13:29:58] can you keep an eye so that es2015 does not page? [13:30:06] I downtimed it till friday [13:30:07] in case icinga decides to do so [13:30:09] yesterday I believe [13:30:09] yeah [13:30:11] but [13:30:12] ah [13:30:12] yes [13:30:14] the reinstall XD [13:30:20] I am not so sure the IPMI is working fine though [13:30:22] look [13:30:23] icinga sometimes decides the same service with the same name [13:30:40] https://phabricator.wikimedia.org/P5105 [13:31:07] it is not the same service, and takes away the downtime, that is why I asked you to keep an eye on it [13:31:20] marostegui, yeah [13:31:24] I am not worried about that [13:31:34] I was planning on restarting this server 10 times [13:31:40] so we can take that later [13:31:46] ok :) [13:31:51] no worries, I will keep an eye on it [13:32:14] is it now stopped? powered off? reimaging or just waiting for you to act on it? [13:32:34] it was installed [13:32:39] ah cool! [13:32:44] I am checkint the install itself was done ok [13:32:47] ok :) [13:32:51] go and have lunch then [13:32:56] ok [13:33:08] and the next step is to touch a file on /srv [13:33:19] and reinstall with the new recipe [13:33:47] ok [14:14:06] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3121980 (10Marostegui) db1081 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb1081 commonswiki -e "show create table im... [14:14:32] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3121982 (10Marostegui) db1081 is done: ``` root@neodymium:~# mysql... [14:47:23] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122024 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['es2015.codfw.wmnet'] ``` The log can be found in `/var/log/wm... [14:49:54] https://phabricator.wikimedia.org/P5107 [14:50:45] yeah [14:50:48] saw that this morning [14:50:53] when executing the command [14:51:17] I guess we'd need to see what they change it for and see if it affect wmf-reimage script if it uses it [14:51:28] it shouldn't [14:51:44] but it will affect the "manual" documentation [14:52:09] I am connnected to serial console cross fingers for the recipe [14:52:28] * marostegui crossing his fingers and his cat's fingers [14:53:10] no errors [14:53:28] I didn't see formating /srv, but that doesn't prove anything [14:53:39] (only / ext3) [14:54:15] that is a good sign :) [15:04:14] it tried to boot from network after install [15:04:22] bios boot order is wrong [15:07:09] yes, it is wrong [15:07:21] I am changing it now [15:07:27] did it reset for you? [15:07:38] or did you just boot once with the boot manager [15:07:49] https://phabricator.wikimedia.org/T160242#3120884 [15:07:56] I tried to change it with no luck [15:07:58] it never worked [15:08:08] I believe we need papaul to change it for us [15:08:17] I was chatting to him around 20 minutes ago or so [15:08:24] and ask him to hold until you were done with your tests [15:09:49] I tried changing it on the bios but I couldn't and then I tried with racadm but even though it says "successfully" it wasn't changed [15:10:25] strip size is 256K [15:10:28] so that is kept [15:11:22] well, the raid was never destroyed [15:11:24] "I tried to change it with no luck" [15:11:46] so you did the same thing I am doing? putting the disk first and it didnt work? [15:12:05] So what I did was: [15:12:31] 1) trying to change the order on the BIOS menu, but I wasn't able to change it - so I wasn't able to change the order of hte boot sequence [15:12:46] I was able [15:12:47] 2) tried racadm, which says: all done, but once i rebooted, it was still trying PXE first [15:12:51] Oh really? [15:12:51] it is booting from C now [15:12:56] I think I know what it is [15:13:00] it is not your fault [15:13:13] which keys did you use? I tried all the normal ones: tab, +,-, space, enter... [15:13:14] some BIOS require a proper installation to allow the option [15:13:20] it is not that [15:13:39] you need an os install, and then bios allows you to select C [15:13:48] oh really? never seen that before, interesting [15:14:01] or maybe you didn't cressh pgup ? [15:14:06] one of the two [15:14:10] independently [15:14:17] say it was the BIOS [15:14:47] otherwise you will lose repect as the "hardware guy" :-P [15:14:52] hahahaha [15:15:00] I never said I was the hardware guy! :p [15:15:05] I did [15:15:13] the thing I told you it is true [15:15:15] so, even now isn't booting from C? [15:15:17] on some servers [15:15:24] it has already booted [15:15:46] on some servers the option to boot from C is only available after server install [15:15:57] and crhis and papaul hate that [15:16:00] interesting, never seen it [15:16:04] I now hate that too :p [15:19:44] not sure the installer is doing anything [15:19:51] it is stuck on the same place than before [15:20:09] despite me adding it to salt this time [15:20:46] last time I added —debug to the installer [15:20:52] and I got some more info about what was going on [15:20:59] /srv is not mounted [15:21:26] I guess puppet hasn't run yet, no? As I cannot loging yet [15:22:39] the partition with lvm is there [15:22:51] but I am afraid it may have been recreated and just not formatted [15:24:00] mount: wrong fs type, bad option, bad superblock on /dev/mapper/tank-data, [15:24:51] are you able to see the FS on it? [15:24:53] on not even? [15:24:56] *or [15:25:14] no, I think it is recreated but not formated [15:25:23] which is what I just told the recipe [15:25:35] "do not format" [15:26:19] well, if the lvm is there, then we should be able to mount it no? [15:27:25] no, it is not the old lvm [15:28:08] partitioning happened and distroyed all partitions [15:28:47] right, because /dev/mapper is still / [15:29:11] mm, but: v_name{ data } [15:29:57] https://phabricator.wikimedia.org/P5108 [15:30:40] 2017-03-22 14:53:00 is on the reimage [15:30:50] it is just has not been formatted [15:32:11] ah, i see [15:32:22] right, so it was indeed created on the reimage [15:32:26] (as per your comment on the hour) [15:33:04] # the install makes sure we want to wipe the lvm [15:35:15] yeah, I am checking again the partman recipe [15:36:27] https://gerrit.wikimedia.org/r/#/c/344160/ [15:38:23] that should be it [15:38:31] let's try [15:38:34] :-/ [15:38:45] I am trying to format the partition [15:40:15] ive formated it as ext4 [15:40:24] I am not going to install xfs just to reimage again [15:40:31] hehe yeah [15:40:34] did it work fine? [15:58:44] it enter an infinite loop- which may not be that bad :-) [15:58:49] *enters [15:59:02] of getting restarted all the time? [15:59:04] reimaged [15:59:14] no, no root partition defined [15:59:19] ah :( [15:59:28] but I cannot go and override it manually [15:59:43] technically, this solves the issue [16:00:09] but it doesn't get reimaged, no? [16:00:58] -rw-r--r-- 1 root root 0 Mar 22 15:40 did_pacman_delete_this_file_? [16:01:03] nope [16:01:10] it is a horrible thing [16:01:19] but technically, it works [16:01:41] haha nice name [16:02:50] none of he partitions are imaged [16:02:50] I wonder if the boot order thingy is the thing affecting the reimage [16:02:56] oh [16:03:17] so I would continue trying to get the recipe fixed [16:03:20] don't get me wrong [16:03:30] but I think like this is better than before [16:03:44] yeah, at least it doesn't wipe the data [16:03:45] so when you want to reimage, you have to add db.cfg manually [16:03:54] as I did on the previous patch [16:03:57] I know it is horrible [16:03:58] yeah [16:04:01] Well, it is safe [16:04:11] but this will get eventually fixed [16:04:22] and if a large reimage happens, only 1 commit is needed [16:04:34] happens to be needed [16:05:00] I will setup a ticket to create a proper recipe [16:05:12] sounds good [16:05:22] but not with high priority [16:05:23] however, we still need papaul to change the order or not needed anymore? [16:05:31] I think I fixed that [16:05:40] so we can attempt a normal reimage? [16:05:43] (changing the db.cfg) [16:05:44] the ipmi still doesn't work [16:05:51] no need really [16:05:57] it works now well [16:06:04] it just needs to run puppet [16:06:05] so, if you reboot it manually, it boots from disk again? [16:06:11] yes [16:06:14] but let me try it [16:06:17] once more [16:06:27] what it is broken is the installer on this host [16:06:31] to be honest, I am glad we "found" this yesterday and not sometime while codfw was active [16:06:41] so ipmi [16:06:46] and the script+salt [16:06:49] we can try papaul to reseat it for us [16:07:24] there is an option on the bios, too [16:07:29] we can ask chris [16:08:29] I am talking to papaul now, I can ask him to see if he can reset it for us [16:08:32] he is onsite [16:08:40] it boots just right to disk [16:08:46] let me put it down [16:08:48] ok [16:10:42] I am going to leave the server on the bios menu [16:10:54] ok [16:11:21] papaul will reset it for us in a bit [16:11:24] I think chris mentioned a BIOS option that enabler or disabled it [16:11:29] reset what? [16:12:13] sometimes idrac has issues and you need to leave the server without ANY power for a bit [16:12:28] ok [16:12:47] other than that, maybe we can start recloning it from es2014 tomorrow? [16:13:21] can you logoff if you are? [16:13:24] (from the idrac) [16:13:41] I have [16:14:42] thanks [16:16:44] looks like IPMI was disabled [16:17:52] works now [16:17:53] root@neodymium:~# ipmitool -I lanplus -H es2015.mgmt.codfw.wmnet -U root -E chassis power status [16:17:57] Unable to read password from environment [16:17:59] Password: [16:18:02] Chassis Power is on [16:23:19] I am running puppet on the server [16:35:01] dbstore1001 is you? aka alter [16:35:51] ah [16:35:52] no [16:35:55] it is the dumps [16:35:58] :) [16:36:04] which I think they are starting now [16:37:56] so es2015 looks set up now [16:38:10] puppet run and all? [16:38:14] yep [16:38:17] salt and all that [16:38:50] maybe we can depool es2014 and start the transfer [16:38:57] wait [16:39:19] oh, you did reformat /srv already [16:39:22] yeah [16:39:25] I assumed you were done [16:39:38] yes [16:39:45] I just wanted to put it back [16:39:59] my motto is that if I broke it I fix it [16:40:03] ah [16:40:05] and I broke /srv [16:40:12] wanted to put it back [16:40:12] haha I broke the whole server! [16:40:17] by rebooting it [16:40:18] you didn't [16:40:38] although I told you to stop working late :-) [16:40:42] I remember that [16:40:48] that is true! [16:40:52] i actually remembered that [16:40:54] so probably you should that rright now [16:41:07] I've just waken up [16:41:11] haha [16:41:16] are you in pacific zone again? [16:42:50] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122315 (10Marostegui) p:05High>03Normal a:05Papaul>03None [16:44:12] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3092963 (10Marostegui) The server is now set up, and ready to get the data from es2014. Things that have been done: - Tested a new way to prevent a server to avoid wiping the part... [16:44:57] jynus: I think I will take your advice now and logoff, if you are fine with it, maybe you can start the transfer from es2014 to it [16:45:06] I was going to [16:45:08] if not, I can do that tomorrow morning [16:45:10] Ah :) [16:45:24] great! [16:45:41] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122330 (10Marostegui) a:03jcrespo [16:45:57] I am going to do groceries then [16:46:04] thanks for all the help [16:46:43] db1068 and labsdb1011 still running the alter table fyi [17:03:36] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122398 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['es2015.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['es2015.codfw.wmnet']) ``` [17:06:00] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122421 (10jcrespo) I am going to use the codfw master **es2016** not es2014, because the latter does have compressed tables- something we have yet to fix, and not something we wan... [17:17:10] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122439 (10jcrespo) I've started the transfer from es2016 to es2015, the transfer may take 11-12 hours, so it will finish by ~6-7 UTC. es2016 and es2015 will be down during it. I h... [17:53:52] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122497 (10MusikAnimal) @jcrespo Finally got around to this, below are my results. This test query was ran on my local Vagrant... [17:56:13] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122503 (10jcrespo) You have to run the query (not the explain). ``` FLUSH STATUS; SELECT ...; SHOW STATUS like 'Hand%'; ```... [17:59:38] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122519 (10MusikAnimal) >>! In T156318#3122503, @jcrespo wrote: > You have to run the query (not the explain). > ``` > FLUSH... [18:02:09] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122541 (10jcrespo) Then it is the rows- running that on 4/9 rows is not useful (there are 1000 million revision rows only on... [18:10:10] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122655 (10jcrespo) https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&var-server=es2015&var-network=eth0&from=1490198400000&to=now [18:21:11] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122730 (10MusikAnimal) >>! In T156318#3122541, @jcrespo wrote: > Then it is the rows- running that on 4/9 rows is not useful... [18:27:40] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122765 (10jcrespo) Let me create a test table with you somewhere- but please give me a one-liner to set it up.- E.g. ``` CRE... [18:36:05] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122776 (10MusikAnimal) >>! In T156318#3122765, @jcrespo wrote: > Let me create a test table with you somewhere- but please gi... [18:50:01] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122811 (10jcrespo) @MusikAnimal that is very difficult. I need to disable the binary logs and use SUPER to write to a read-on... [18:52:06] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122817 (10jcrespo) For example: ``` root@db2048[enwiki]> SELECT HEX(INET_ATON('192.0.0.1')); +-----------------------------+... [19:06:15] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122881 (10MusikAnimal) @jcrespo Nice! It looks like `SELECT HEX(INET6_ATON('192.168.0.1'));` does exactly what we want, but i... [19:16:43] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122899 (10jcrespo) Thanks, I may be going soon, but I will try to execute that and see what we get- and have it ready for tom... [19:44:20] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3123004 (10jcrespo) It took less time than I though- you can check the table now. [19:55:17] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3123050 (10jcrespo) 05Open>03Resolved a:03jcrespo My alter seemed to fix the issue. {F6865546} However, that table is no longer compressed. Probably it is due to the rebuild, not the compression,... [19:59:27] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: db1094 crash - https://phabricator.wikimedia.org/T160832#3123061 (10jcrespo) 05Open>03Resolved a:03jcrespo Resolved- we have to contact the vendor if it happens any other time. [20:03:51] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3123076 (10jcrespo) p:05Triage>03Low [20:05:49] 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3123104 (10jcrespo) [20:05:52] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3123103 (10jcrespo) 05Open>03Resolved [21:18:36] 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123297 (10thcipriani) [21:35:36] 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123297 (10hashar) [[ https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=codfw%20prometheus%2Fops&var-server=es2016&from=now-12h&to=now 12 hours view of prometh... [21:39:04] 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123361 (10chasemp) seems that way, I didn't see that sal and texted @Marostegui to ask (sorry buddy!) [21:49:54] 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123297 (10bd808) These all seem to be requests for enwiki [[Main Page]] on codfw app and api servers which end up trying to fetch revision content from es2016.codfw... [21:53:30] 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123388 (10bd808) {F6869248} Requests are coming from einsteinium.wikimedia.org and tegmen.wikimedia.org and appear likely to be icinga checks. h/t @EBernhardson [22:31:38] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123543 (10yuvipanda) a:05yuvipanda>03None [22:31:53] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123552 (10yuvipanda) This was done, and @madhuvishy just made it work... [22:32:04] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123554 (10yuvipanda) a:03madhuvishy