[00:23:32] <wikibugs_>	 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3120457 (10jcrespo) > Is it a good idea to attempt the handling of such problems in the DB layer  No, of course it is not. But this is the l...
[06:38:06] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3120803 (10Marostegui) db2037 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2037.codfw.wmnet commonswiki -e "show cre...
[06:39:15] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3120804 (10Marostegui) db2037 is done: ``` root@neodymium:~# mysql...
[07:03:47] <wikibugs_>	 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3120812 (10Marostegui) And the version is also different: ``` root@s5-master[wikidatawiki]> select version(); +---------------------+ | version()           | +---------------------+ | 10.0.22-MariaDB-lo...
[07:04:31] <wikibugs_>	 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3120813 (10Marostegui) Mmm, looks like dbstore2001 is now giving the correct plan: ``` root@dbstore2001.codfw.wmnet[wikidatawiki]> EXPLAIN DELETE /* Wikibase\Repo\ChangePruner::pruneChanges  */ FROM `wb...
[07:25:31] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120835 (10Marostegui) I think it has been wiped as it doesn't even show the GRUB after selecting to boot from disk. As I said, the hard disks are being show in the RAID and BIOS menu. ``` 0 Non-RAID Di...
[07:30:32] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120836 (10Marostegui) Also tried to reinstall grub just in case it was the only thing deleted, but also failed on that. So maybe it was indeed reimaged and when I stopped it, was already half way thru...
[07:35:07] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3120846 (10Marostegui) Pending host in codfw: db2019 (the master),...
[07:52:25] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120884 (10Marostegui)  The new mainboard is configured to always boot from PXE. ```   System BIOS Settings > Boot Settings > BIOS Boot Settings    Boot Sequence...
[08:21:03] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120905 (10Marostegui) p:05Normal>03High a:05Marostegui>03Papaul Before doing a proper reimage, we need to change the boot sequence to boot first from disk and if not, from the NIC. I am not bei...
[08:37:33] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120914 (10Marostegui) @MoritzMuehlenhoff kindly help and suggested: `racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD` Which I tried, but had not effect on the boot order: ``` /admin1-> r...
[08:44:53] <wikibugs_>	 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3120920 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1087.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimag...
[09:06:40] <wikibugs_>	 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3120964 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1087.eqiad.wmnet'] ```  and were **ALL** successful.
[09:23:02] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3120985 (10Marostegui) db1091 is done: ``` root@neodymium:~# mysql...
[09:24:37] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3120987 (10Marostegui) codfw is only pending the master. I will do it once I am done with eqiad hosts (which I have started...
[09:57:29] <wikibugs_>	 10DBA: Defragment: db1091, db1084, db1081 - https://phabricator.wikimedia.org/T161088#3121070 (10Marostegui)
[09:58:33] <wikibugs_>	 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3121087 (10jcrespo) p:05Triage>03Low I converted it to dynamic row format- whether it was the compression or just the reconstruction is yet to see.
[10:03:35] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121113 (10jcrespo) Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had with some of the servers in the past.  The r...
[10:12:01] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121204 (10Marostegui) >>! In T160242#3121113, @jcrespo wrote: > Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had...
[12:07:20] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121607 (10jcrespo) Can I reimage the server? https://gerrit.wikimedia.org/r/344108
[12:08:14] <marostegui>	 jynus: ^ go ahead
[12:08:34] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121612 (10Marostegui) Go ahead
[12:09:09] <jynus>	 I will reimage it with that config as that should be the workflow
[12:09:15] <jynus>	 create then a file on /srv
[12:09:18] <jynus>	 reimage again
[12:09:26] <jynus>	 with the regular recipe
[12:09:31] <jynus>	 the new one, I mean
[12:09:36] <jynus>	 and see if it works
[12:09:51] <marostegui>	 it can be a big win
[12:09:52] <marostegui>	 it if works fine
[12:10:13] <jynus>	 I have no idea what I am doing, we will see
[12:10:18] <marostegui>	 no more copy data to somewhere else 
[12:10:19] <marostegui>	 hahaha
[12:10:24] <jynus>	 well, actually
[12:10:29] <jynus>	 that was on purpose
[12:10:38] <jynus>	 1) precise hosts needed repartition
[12:10:53] <jynus>	 2) it is dangerous and unsafe
[12:10:58] <jynus>	 see revent db1057
[12:11:01] <jynus>	 *recent
[12:11:12] <jynus>	 if I hadn't copy it, we would have lost its data
[12:11:17] <marostegui>	 yeah
[12:11:22] <marostegui>	 that's true
[12:11:36] <marostegui>	 chris hasn't poked it yet, no?
[12:11:45] <jynus>	 once no more precise hosts (or precise hosts badly upgraded to trusty) are here
[12:11:54] <jynus>	 we can consider it on a case by case bases
[12:12:01] <marostegui>	 yep
[12:12:09] <marostegui>	 but it shopuld be the default for big big servers
[12:12:16] <marostegui>	 es, dbstore, sanitariums, etc
[12:12:18] <jynus>	 yes
[12:12:34] <jynus>	 but again, last time we did it was to wipe old data too
[12:12:46] <marostegui>	 yes, last time it made total sense
[12:12:46] <jynus>	 and a backup was needed anyway- see dbstore1001
[12:13:08] <jynus>	 so again, it was not like that on purpose
[12:14:09] <jynus>	 once a proper automatic provisioning system is in place
[12:14:14] <jynus>	 and everthing on jessie
[12:14:19] <jynus>	 probably is the way to go
[12:14:26] <jynus>	 but backups may still be needed
[12:14:34] <marostegui>	 yeah, agreed, reinstall / but not srv by default
[12:14:38] <marostegui>	 and we can always make exceptions
[12:14:54] <jynus>	 this is actually not about that
[12:15:01] <jynus>	 it is about not wiping data by default
[12:15:18] <marostegui>	 yes, that is what I meant
[12:15:22] <jynus>	 if for some reson it got on install
[12:15:33] <jynus>	 a puppet change would be needed to remove /srv
[12:15:53] <marostegui>	 yes yes, I wasn't clear, with /srv I meant its content 
[12:16:18] <jynus>	 I am not sure this will work
[12:16:32] <jynus>	 but easiest way is to try I think
[12:16:48] <jynus>	 and there is nothing to lose on an actual host
[12:16:57] <marostegui>	 yep, we have a host we can play around with  :)
[12:20:56] <jynus>	 wmf-auto-reimage -p T160242 es2015.codfw.wmnet
[12:20:57] <stashbot>	 T160242: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242
[12:21:01] <jynus>	 marostegui^
[12:21:03] <jynus>	 ok?
[12:21:45] <marostegui>	 yes
[12:21:46] <marostegui>	 lokos good
[12:21:50] <marostegui>	 looks
[12:22:28] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3092963 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['es2015.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/2017032...
[12:25:04] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3121643 (10Marostegui) db1084 and labsdb1010 are done: ``` root@neodymium:~# for i in  labsdb1010.eqiad.wmnet db1084.eqiad....
[12:26:19] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3121660 (10Marostegui) db1084 and labsdb1010 are done: ``` root@neo...
[12:47:44] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure: Explore 'Analyze' statement as substitute for Explain - https://phabricator.wikimedia.org/T141095#3121693 (10jcrespo) https://tools.wmflabs.org/tools-info/optimizer.py no longer works, and that is a problem for users wanting to EXPLAIN their queries.
[13:00:41] <wikibugs_>	 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#3121712 (10jcrespo)
[13:01:31] <wikibugs_>	 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#2491529 (10jcrespo)
[13:01:54] <wikibugs_>	 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#2491529 (10jcrespo)
[13:04:20] <wikibugs_>	 10DBA, 10Monitoring, 06Operations, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#2491529 (10Marostegui) Good example of a db server where that happens with big alter tables: dbstore2001
[13:17:22] <wikibugs_>	 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3121750 (10Marostegui) So far so good!  ``` root@EVENTLOGGING m4[log]> show tables like 'ImageMetr...
[13:25:54] <jynus>	 wmf-reimage failed badly for es2015
[13:26:00] <marostegui>	 :(
[13:26:03] <jynus>	 I had to kill it server
[13:26:04] <marostegui>	 the partitioner?
[13:26:06] <jynus>	 I am doing it manually
[13:26:06] <jynus>	 no
[13:26:11] <jynus>	 the salt call
[13:26:14] <marostegui>	 oh
[13:26:20] <jynus>	 gets stuck if there is not salt previously
[13:26:33] <jynus>	 db1057 was also being installed still
[13:26:35] <marostegui>	 did you use —new?
[13:26:46] <jynus>	 no, because technically, it was not new :-)
[13:27:03] <jynus>	 I had to kill db1057 install process on puppetmaster
[13:28:15] <marostegui>	 last time I had an issue when it got stuck on the salt thing, it was actually the IPMI not working
[13:28:27] <jynus>	 it was not that
[13:28:27] <marostegui>	 I have checkd and es2015 isn't here https://phabricator.wikimedia.org/T150160
[13:28:39] <jynus>	 IPIMI worked fine
[13:28:50] <jynus>	 although I had to reboot it manually, too
[13:29:31] <jynus>	 I am going to lunch, did you go aleady?
[13:29:38] <marostegui>	 i did yep
[13:29:58] <jynus>	 can you keep an eye so that es2015 does not page?
[13:30:06] <marostegui>	 I downtimed it till friday
[13:30:07] <jynus>	 in case icinga decides to do so
[13:30:09] <marostegui>	 yesterday I believe
[13:30:09] <jynus>	 yeah
[13:30:11] <jynus>	 but
[13:30:12] <marostegui>	 ah
[13:30:12] <marostegui>	 yes
[13:30:14] <marostegui>	 the reinstall XD
[13:30:20] <marostegui>	 I am not so sure the IPMI is working fine though
[13:30:22] <marostegui>	 look
[13:30:23] <jynus>	 icinga sometimes decides the same service with the same name
[13:30:40] <marostegui>	 https://phabricator.wikimedia.org/P5105
[13:31:07] <jynus>	 it is not the same service, and takes away the downtime, that is why I asked you to keep an eye on it
[13:31:20] <jynus>	 marostegui, yeah
[13:31:24] <jynus>	 I am not worried about that
[13:31:34] <jynus>	 I was planning on restarting this server 10 times
[13:31:40] <jynus>	 so we can take that later
[13:31:46] <marostegui>	 ok :)
[13:31:51] <marostegui>	 no worries, I will keep an eye on it
[13:32:14] <marostegui>	 is it now stopped? powered off? reimaging or just waiting for you to act on it?
[13:32:34] <jynus>	 it was installed
[13:32:39] <marostegui>	 ah cool!
[13:32:44] <jynus>	 I am checkint the install itself was done ok
[13:32:47] <marostegui>	 ok :)
[13:32:51] <marostegui>	 go and have lunch then
[13:32:56] <jynus>	 ok
[13:33:08] <jynus>	 and the next step is to touch a file on /srv
[13:33:19] <jynus>	 and reinstall with the new recipe
[13:33:47] <marostegui>	 ok
[14:14:06] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3121980 (10Marostegui) db1081 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb1081 commonswiki -e "show create table im...
[14:14:32] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3121982 (10Marostegui) db1081 is done: ``` root@neodymium:~# mysql...
[14:47:23] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122024 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['es2015.codfw.wmnet'] ``` The log can be found in `/var/log/wm...
[14:49:54] <jynus>	 https://phabricator.wikimedia.org/P5107
[14:50:45] <marostegui>	 yeah
[14:50:48] <marostegui>	 saw that this morning
[14:50:53] <marostegui>	 when executing the command
[14:51:17] <marostegui>	 I guess we'd need to see what they change it for and see if it affect wmf-reimage script if it uses it
[14:51:28] <jynus>	 it shouldn't
[14:51:44] <jynus>	 but it will affect the "manual" documentation
[14:52:09] <jynus>	 I am connnected to serial console cross fingers for the recipe
[14:52:28] * marostegui crossing his fingers and his cat's fingers
[14:53:10] <jynus>	 no errors
[14:53:28] <jynus>	 I didn't see formating /srv, but that doesn't prove anything
[14:53:39] <jynus>	 (only / ext3)
[14:54:15] <marostegui>	 that is a good sign :)
[15:04:14] <jynus>	 it tried to boot from network after install
[15:04:22] <jynus>	 bios boot order is wrong
[15:07:09] <marostegui>	 yes, it is wrong
[15:07:21] <jynus>	 I am changing it now
[15:07:27] <jynus>	 did it reset for you?
[15:07:38] <jynus>	 or did you just boot once with the boot manager
[15:07:49] <marostegui>	 https://phabricator.wikimedia.org/T160242#3120884
[15:07:56] <marostegui>	 I tried to change it with no luck
[15:07:58] <marostegui>	 it never worked
[15:08:08] <marostegui>	 I believe we need papaul to change it for us
[15:08:17] <marostegui>	 I was chatting to him around 20 minutes ago or so
[15:08:24] <marostegui>	 and ask him to hold until you were done with your tests
[15:09:49] <marostegui>	 I tried changing it on the bios but I couldn't and then I tried with racadm but even though it says "successfully" it wasn't changed
[15:10:25] <jynus>	 strip size is 256K
[15:10:28] <jynus>	 so that is kept
[15:11:22] <marostegui>	 well, the raid was never destroyed
[15:11:24] <jynus>	 "I tried to change it with no luck"
[15:11:46] <jynus>	 so you did the same thing I am doing? putting the disk first and it didnt work?
[15:12:05] <marostegui>	 So what I did was:
[15:12:31] <marostegui>	 1) trying to change the order on the BIOS menu, but I wasn't able to change it - so I wasn't able to change the order of hte boot sequence
[15:12:46] <jynus>	 I was able
[15:12:47] <marostegui>	 2) tried racadm, which says: all done, but once i rebooted, it was still trying PXE first
[15:12:51] <marostegui>	 Oh really?
[15:12:51] <jynus>	 it is booting from C now
[15:12:56] <jynus>	 I think I know what it is
[15:13:00] <jynus>	 it is not your fault
[15:13:13] <marostegui>	 which keys did you use? I tried all the normal ones: tab, +,-, space, enter...
[15:13:14] <jynus>	 some BIOS require a proper installation to allow the option
[15:13:20] <jynus>	 it is not that
[15:13:39] <jynus>	 you need an os install, and then bios allows you to select C
[15:13:48] <marostegui>	 oh really? never seen that before, interesting
[15:14:01] <jynus>	 or maybe you didn't cressh pgup ?
[15:14:06] <jynus>	 one of the two
[15:14:10] <jynus>	 independently
[15:14:17] <jynus>	 say it was the BIOS
[15:14:47] <jynus>	 otherwise you will lose repect as the "hardware guy" :-P
[15:14:52] <marostegui>	 hahahaha
[15:15:00] <marostegui>	 I never said I was the hardware guy! :p
[15:15:05] <jynus>	 I did
[15:15:13] <jynus>	 the thing I told you it is true
[15:15:15] <marostegui>	 so, even now isn't booting from C?
[15:15:17] <jynus>	 on some servers
[15:15:24] <jynus>	 it has already booted
[15:15:46] <jynus>	 on some servers the option to boot from C is only available after server install
[15:15:57] <jynus>	 and crhis and papaul hate that
[15:16:00] <marostegui>	 interesting, never seen it
[15:16:04] <marostegui>	 I now hate that too :p
[15:19:44] <jynus>	 not sure the installer is doing anything
[15:19:51] <jynus>	 it is stuck on the same place than before
[15:20:09] <jynus>	 despite me adding it to salt this time
[15:20:46] <marostegui>	 last time I added —debug to the installer
[15:20:52] <marostegui>	 and I got some more info about what was going on
[15:20:59] <jynus>	  /srv is not mounted
[15:21:26] <marostegui>	 I guess puppet hasn't run yet, no? As I cannot loging yet
[15:22:39] <jynus>	 the partition with lvm is there
[15:22:51] <jynus>	 but I am afraid it may have been recreated and just not formatted
[15:24:00] <jynus>	 mount: wrong fs type, bad option, bad superblock on /dev/mapper/tank-data,
[15:24:51] <marostegui>	 are you able to see the FS on it? 
[15:24:53] <marostegui>	 on not even?
[15:24:56] <marostegui>	 *or
[15:25:14] <jynus>	 no, I think it is recreated but not formated
[15:25:23] <jynus>	 which is what I just told the recipe
[15:25:35] <jynus>	 "do not format"
[15:26:19] <marostegui>	 well, if the lvm is there, then we should be able to mount it no?
[15:27:25] <jynus>	 no, it is not the old lvm
[15:28:08] <jynus>	 partitioning happened and distroyed all partitions
[15:28:47] <marostegui>	 right, because /dev/mapper is still /
[15:29:11] <marostegui>	 mm, but: v_name{ data }	
[15:29:57] <jynus>	 https://phabricator.wikimedia.org/P5108
[15:30:40] <jynus>	 2017-03-22 14:53:00 is on the reimage
[15:30:50] <jynus>	 it is just has not been formatted
[15:32:11] <marostegui>	 ah, i see
[15:32:22] <marostegui>	 right, so it was indeed created on the reimage
[15:32:26] <marostegui>	 (as per your comment on the hour)
[15:33:04] <jynus>	 # the install makes sure we want to wipe the lvm
[15:35:15] <marostegui>	 yeah, I am checking again the partman recipe
[15:36:27] <jynus>	 https://gerrit.wikimedia.org/r/#/c/344160/
[15:38:23] <marostegui>	 that should be it
[15:38:31] <marostegui>	 let's try
[15:38:34] <jynus>	 :-/
[15:38:45] <jynus>	 I am trying to format the partition
[15:40:15] <jynus>	 ive formated it as ext4
[15:40:24] <jynus>	 I am not going to install xfs just to reimage again
[15:40:31] <marostegui>	 hehe yeah
[15:40:34] <marostegui>	 did it work fine?
[15:58:44] <jynus>	 it enter an infinite loop- which may not be that bad :-)
[15:58:49] <jynus>	 *enters
[15:59:02] <marostegui>	 of getting restarted all the time?
[15:59:04] <marostegui>	 reimaged
[15:59:14] <jynus>	 no, no root partition defined
[15:59:19] <marostegui>	 ah :(
[15:59:28] <jynus>	 but I cannot go and override it manually
[15:59:43] <jynus>	 technically, this solves the issue
[16:00:09] <marostegui>	 but it doesn't get reimaged, no?
[16:00:58] <jynus>	 -rw-r--r--    1 root     root             0 Mar 22 15:40 did_pacman_delete_this_file_?
[16:01:03] <jynus>	 nope
[16:01:10] <jynus>	 it is a horrible thing
[16:01:19] <jynus>	 but technically, it works
[16:01:41] <marostegui>	 haha nice name
[16:02:50] <jynus>	 none of he partitions are imaged
[16:02:50] <marostegui>	 I wonder if the boot order thingy is the thing affecting the reimage
[16:02:56] <marostegui>	 oh
[16:03:17] <jynus>	 so I would continue trying to get the recipe fixed
[16:03:20] <jynus>	 don't get me wrong
[16:03:30] <jynus>	 but I think like this is better than before
[16:03:44] <marostegui>	 yeah, at least it doesn't wipe the data
[16:03:45] <jynus>	 so when you want to reimage, you have to add db.cfg manually
[16:03:54] <jynus>	 as I did on the previous patch
[16:03:57] <jynus>	 I know it is horrible
[16:03:58] <marostegui>	 yeah
[16:04:01] <marostegui>	 Well, it is safe
[16:04:11] <jynus>	 but this will get eventually fixed
[16:04:22] <jynus>	 and if a large reimage happens, only 1 commit is needed
[16:04:34] <jynus>	 happens to be needed
[16:05:00] <jynus>	 I will setup a ticket to create a proper recipe
[16:05:12] <marostegui>	 sounds good
[16:05:22] <jynus>	 but not with high priority
[16:05:23] <marostegui>	 however, we still need papaul to change the order or not needed anymore?
[16:05:31] <jynus>	 I think I fixed that
[16:05:40] <marostegui>	 so we can attempt a normal reimage?
[16:05:43] <marostegui>	 (changing the db.cfg)
[16:05:44] <jynus>	 the ipmi still doesn't work
[16:05:51] <jynus>	 no need really
[16:05:57] <jynus>	 it works now well
[16:06:04] <jynus>	 it just needs to run puppet
[16:06:05] <marostegui>	 so, if you reboot it manually, it boots from disk again?
[16:06:11] <jynus>	 yes
[16:06:14] <jynus>	 but let me try it
[16:06:17] <jynus>	 once more
[16:06:27] <jynus>	 what it is broken is the installer on this host
[16:06:31] <marostegui>	 to be honest, I am glad we "found" this yesterday and not sometime while codfw was active
[16:06:41] <jynus>	 so ipmi
[16:06:46] <jynus>	 and the script+salt
[16:06:49] <marostegui>	 we can try papaul to reseat it for us
[16:07:24] <jynus>	 there is an option on the bios, too
[16:07:29] <jynus>	 we can ask chris
[16:08:29] <marostegui>	 I am talking to papaul now, I can ask him to see if he can reset it for us
[16:08:32] <marostegui>	 he is onsite
[16:08:40] <jynus>	 it boots just right to disk
[16:08:46] <jynus>	 let me put it down
[16:08:48] <marostegui>	 ok
[16:10:42] <jynus>	 I am going to leave the server on the bios menu
[16:10:54] <marostegui>	 ok
[16:11:21] <marostegui>	 papaul will reset it for us in a bit
[16:11:24] <jynus>	 I think chris mentioned a BIOS option that enabler or disabled it
[16:11:29] <jynus>	 reset what?
[16:12:13] <marostegui>	 sometimes idrac has issues and you need to leave the server without ANY power for a bit
[16:12:28] <jynus>	 ok
[16:12:47] <marostegui>	 other than that, maybe we can start recloning it from es2014 tomorrow?
[16:13:21] <marostegui>	 can you logoff if you are?
[16:13:24] <marostegui>	 (from the idrac)
[16:13:41] <jynus>	 I have
[16:14:42] <marostegui>	 thanks
[16:16:44] <marostegui>	 looks like IPMI was disabled
[16:17:52] <marostegui>	 works now
[16:17:53] <marostegui>	 root@neodymium:~# ipmitool -I lanplus -H es2015.mgmt.codfw.wmnet -U root -E chassis power status
[16:17:57] <marostegui>	 Unable to read password from environment
[16:17:59] <marostegui>	 Password:
[16:18:02] <marostegui>	 Chassis Power is on
[16:23:19] <marostegui>	 I am running puppet on the server
[16:35:01] <jynus>	 dbstore1001 is you? aka alter
[16:35:51] <jynus>	 ah
[16:35:52] <jynus>	 no
[16:35:55] <jynus>	 it is the dumps
[16:35:58] <marostegui>	 :)
[16:36:04] <jynus>	 which I think they are starting now
[16:37:56] <marostegui>	 so es2015 looks set up now
[16:38:10] <jynus>	 puppet run and all?
[16:38:14] <marostegui>	 yep
[16:38:17] <marostegui>	 salt and all that
[16:38:50] <marostegui>	 maybe we can depool es2014 and start the transfer
[16:38:57] <jynus>	 wait
[16:39:19] <jynus>	 oh, you did reformat /srv already
[16:39:22] <marostegui>	 yeah
[16:39:25] <marostegui>	 I assumed you were done
[16:39:38] <jynus>	 yes
[16:39:45] <jynus>	 I just wanted to put it back
[16:39:59] <jynus>	 my motto is that if I broke it I fix it
[16:40:03] <marostegui>	 ah
[16:40:05] <jynus>	 and I broke /srv
[16:40:12] <jynus>	 wanted to put it back
[16:40:12] <marostegui>	 haha I broke the whole server!
[16:40:17] <marostegui>	 by rebooting it
[16:40:18] <jynus>	 you didn't
[16:40:38] <jynus>	 although I told you to stop working late :-)
[16:40:42] <jynus>	 I remember that
[16:40:48] <marostegui>	 that is true!
[16:40:52] <marostegui>	 i actually remembered that
[16:40:54] <jynus>	 so probably you should that rright now
[16:41:07] <jynus>	 I've just waken up
[16:41:11] <marostegui>	 haha
[16:41:16] <marostegui>	 are you in pacific zone again?
[16:42:50] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122315 (10Marostegui) p:05High>03Normal a:05Papaul>03None
[16:44:12] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3092963 (10Marostegui) The server is now set up, and ready to get the data from es2014. Things that have been done:  - Tested a new way to prevent a server to avoid wiping the part...
[16:44:57] <marostegui>	 jynus: I think I will take your advice now and logoff, if you are fine with it, maybe you can start the transfer from es2014 to it
[16:45:06] <jynus>	 I was going to
[16:45:08] <marostegui>	 if not, I can do that tomorrow morning
[16:45:10] <marostegui>	 Ah :)
[16:45:24] <marostegui>	 great!
[16:45:41] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122330 (10Marostegui) a:03jcrespo
[16:45:57] <marostegui>	 I am going to do groceries then
[16:46:04] <marostegui>	 thanks for all the help
[16:46:43] <marostegui>	 db1068 and labsdb1011 still running the alter table fyi
[17:03:36] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122398 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['es2015.codfw.wmnet'] ```  Of which those **FAILED**: ``` set(['es2015.codfw.wmnet']) ```
[17:06:00] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122421 (10jcrespo) I am going to use the codfw master **es2016** not es2014, because the latter does have compressed tables- something we have yet to fix, and not something we wan...
[17:17:10] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122439 (10jcrespo) I've started the transfer from es2016 to es2015, the transfer may take 11-12 hours, so it will finish by ~6-7 UTC. es2016 and es2015 will be down during it. I h...
[17:53:52] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122497 (10MusikAnimal) @jcrespo Finally got around to this, below are my results. This test query was ran on my local Vagrant...
[17:56:13] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122503 (10jcrespo) You have to run the query (not the explain).  ``` FLUSH STATUS; SELECT ...; SHOW STATUS like 'Hand%'; ```...
[17:59:38] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122519 (10MusikAnimal) >>! In T156318#3122503, @jcrespo wrote: > You have to run the query (not the explain).  > ``` > FLUSH...
[18:02:09] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122541 (10jcrespo) Then it is the rows- running that on 4/9 rows is not useful (there are 1000 million revision rows only on...
[18:10:10] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122655 (10jcrespo) https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&var-server=es2015&var-network=eth0&from=1490198400000&to=now
[18:21:11] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122730 (10MusikAnimal) >>! In T156318#3122541, @jcrespo wrote: > Then it is the rows- running that on 4/9 rows is not useful...
[18:27:40] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122765 (10jcrespo) Let me create a test table with you somewhere- but please give me a one-liner to set it up.- E.g.  ``` CRE...
[18:36:05] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122776 (10MusikAnimal) >>! In T156318#3122765, @jcrespo wrote: > Let me create a test table with you somewhere- but please gi...
[18:50:01] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122811 (10jcrespo) @MusikAnimal that is very difficult. I need to disable the binary logs and use SUPER to write to a read-on...
[18:52:06] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122817 (10jcrespo) For example: ``` root@db2048[enwiki]> SELECT HEX(INET_ATON('192.0.0.1')); +-----------------------------+...
[19:06:15] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122881 (10MusikAnimal) @jcrespo Nice! It looks like `SELECT HEX(INET6_ATON('192.168.0.1'));` does exactly what we want, but i...
[19:16:43] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3122899 (10jcrespo) Thanks, I may be going soon, but I will try to execute that and see what we get- and have it ready for tom...
[19:44:20] <wikibugs_>	 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3123004 (10jcrespo) It took less time than I though- you can check the table now.
[19:55:17] <wikibugs_>	 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3123050 (10jcrespo) 05Open>03Resolved a:03jcrespo My alter seemed to fix the issue. {F6865546}  However, that table is no longer compressed. Probably it is due to the rebuild, not the compression,...
[19:59:27] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: db1094 crash - https://phabricator.wikimedia.org/T160832#3123061 (10jcrespo) 05Open>03Resolved a:03jcrespo Resolved- we have to contact the vendor if it happens any other time.
[20:03:51] <wikibugs_>	 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3123076 (10jcrespo) p:05Triage>03Low
[20:05:49] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3123104 (10jcrespo)
[20:05:52] <wikibugs_>	 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3123103 (10jcrespo) 05Open>03Resolved
[21:18:36] <wikibugs_>	 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123297 (10thcipriani)
[21:35:36] <wikibugs_>	 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123297 (10hashar) [[ https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=codfw%20prometheus%2Fops&var-server=es2016&from=now-12h&to=now 12 hours view of prometh...
[21:39:04] <wikibugs_>	 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123361 (10chasemp) seems that way, I didn't see that sal and texted @Marostegui to ask (sorry buddy!)
[21:49:54] <wikibugs_>	 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123297 (10bd808) These all seem to be requests for enwiki [[Main Page]] on codfw app and api servers which end up trying to fetch revision content from es2016.codfw...
[21:53:30] <wikibugs_>	 10DBA: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41) - https://phabricator.wikimedia.org/T161159#3123388 (10bd808) {F6869248} Requests are coming from einsteinium.wikimedia.org and tegmen.wikimedia.org and appear likely to be icinga checks. h/t @EBernhardson
[22:31:38] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123543 (10yuvipanda) a:05yuvipanda>03None
[22:31:53] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123552 (10yuvipanda) This was done, and @madhuvishy just made it work...
[22:32:04] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123554 (10yuvipanda) a:03madhuvishy