[06:43:44] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3722358 (10Marostegui) [06:46:46] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3722362 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2089.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017103... [06:50:44] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3722366 (10Marostegui) [07:05:56] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3722374 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2089.codfw.wmnet'] ``` and were **ALL** successful. [08:10:30] 10DBA, 10Patch-For-Review: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088#3722417 (10Marostegui) [08:26:41] marostegui: I am ready for db1108 whenever you are :) [08:26:47] (even later on in the morning) [08:27:06] o/ [08:27:18] let's stop mysql then on db1108 [08:27:42] stopping [08:27:51] super [08:28:18] super nice that now we can mask things like eventlogging_sync [08:28:29] (just checked, it is masked) [08:29:15] mysql stopped [08:29:17] let's run puppet? [08:29:23] sure [08:29:28] ok [08:29:30] doing it [08:31:22] elukey: https://phabricator.wikimedia.org/P6233 [08:31:23] check at the end [08:32:07] ah yes worked as designed :D [08:32:10] cool! [08:32:18] so let's start mysql again [08:32:43] mysql is up [08:33:03] let me see if i can see the data [08:33:28] all looks good from my side [08:36:09] looks good indeed, trying to do a select count(*) on MediaViewer_10867062_15423246 to see how fast it is :D [08:36:51] hahaha [08:37:00] it has the buffer pool cold still! [08:37:23] going to add db1108 to tendril [08:40:56] 1 row in set (4 min 17.00 sec) [08:40:59] impressive [08:41:07] \o/ [08:42:08] 382 tables like on db1047, looks goood [08:42:22] /dev/mapper/tank-data 3.6T 1.2T 2.5T 33% /srv [08:42:29] that is good as well [08:42:29] the eventlogging script will need some time i guess to re-sync data [08:42:40] oh yes but it will do it in small batches (1000) [08:42:43] nice [08:42:49] re-enabling it [08:43:06] i think this time it took longer because we tried some things, but now we now how to do it [08:43:09] without splitting tables [08:43:23] and the import should be finished in 48h or so [08:43:33] (for the next server i mean) [08:43:44] super [08:44:04] the script is running fine! I'll keep it monitored and inform my team about the new db [08:44:10] sounds good!! [08:44:24] let's keep db1047 for a week or something to make sure db1108 works as expected I would say, no? [08:44:27] the next step will be to move the analytics-slave CNAME to db1108 when ready, and then ping all the people having data on db1047 to migrate things [08:44:31] yes yes [08:44:43] before deprecating it I'll make a very careful review :) [08:44:50] the idea is to [08:44:59] yeah, i would actually move the cname maybe next monday? so people can start using it? [08:45:25] even earlier on, the majority of usage comes from our report updater jobs afaik [08:45:43] ah cool :-) [08:45:48] next week I'll announce to people that the log database will only be available for queries on db1108 [08:46:00] and also set a deadline to drop the log db on dbstore1002 [08:46:27] sounds good to me! [08:46:36] thanks a lot for all the work!! [08:46:39] <3 [08:46:43] my pleasure!! [08:47:50] we have to remember to enable notifications on db1108 whenever we are ready [08:48:11] yep you are definitely right, will do it on the 2nd [08:48:55] cool! [08:52:47] 10DBA, 10Analytics, 10Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3722449 (10Marostegui) db1047's data has been migrated to db1108. It is working fine now. We are going to leave db1108 working for a few days, make sure the event logging syn... [08:53:07] marostegui: qq about refreshlinks - is there a way to see what was changed by one of the last refreshlink jobs on say the pagelink table? [08:53:31] or what kind of update queries are hitting the dbs [08:53:43] (update/delete/etc..) [08:53:47] let me see the binlogs [08:54:02] because it might be a good indicator that a template has been changed or similar [08:54:07] I can't find this info anywhere [08:56:22] i am checking binlogs on enwiki [08:56:29] as I assume there will be jobs for enwiki too? [08:56:41] the major issue seems to be commons [08:56:46] ah ojk [08:56:49] let me move to commons [08:56:51] sorry didn't mention :( [08:58:07] lets seeee [08:58:54] atm the jobrunners are processing roojobs with timestamp 20171028115540 [08:59:04] but I can see that they keep being inserted even now [09:00:17] check what i sent you [09:03:35] marostegui: morning, let me know when you want to merge that change and how I can help ;) [09:03:46] let's go for it? [09:07:44] sure! what do you want me to keep an eye on? [09:08:35] basically logtash i would [09:08:38] i would say [09:08:40] that we have to look for [09:08:55] just for errors [09:09:01] as nothing will hit it really [09:09:21] going to +2 then [09:10:24] which is the right kibana dashboard those days? [09:10:48] i am going to use fatals i think [09:10:53] apart from DBQuery [09:10:56] And mediawiki [09:11:34] ofc [09:11:50] deploying... [09:11:53] rip XD [09:12:29] deployed [09:13:04] i can browse enwikipedia, commons and dewiki [09:13:43] cache miss? [09:14:22] I can with the mwdebug extension [09:14:43] server:mw2099.codfw.wmnet [09:15:50] i am not seeing anything alarming on fatals [09:16:05] agree [09:17:31] marostegui: why https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php is not up to date? [09:17:44] it is for me [09:17:46] cache? [09:18:37] probably, it is in prod [09:18:50] https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php&hello [09:18:54] can you see it now? [09:19:18] yep [09:20:00] i think nothing broke [09:20:07] shhhhh [09:20:10] X-DDDD [09:20:15] last famous words is always around the corner ;) [09:20:56] the sooner we say them, the sooner we can revert and not during lunch :p [09:21:33] lol [09:21:51] do you know what are all those unrelated "Error reading image metadata: Failed to read image data" errors from thumbor? [09:22:31] nope [09:55:11] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3722507 (10Marostegui) This patch has been merged and deployed: https://gerrit.wikimedia.org/r/#/c/386810/ we know have db2084 as the first host "serving" (not really because it is codfw) with multi-inst... [10:00:31] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3722513 (10Marostegui) [10:01:37] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3695660 (10Marostegui) db2089 is done. It is replicating s6 and s5. It will replicate s8 in the future (as specified in the original task description) but as s8 will be a split of wikidata from s5, it is... [11:19:16] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3722684 (10Marostegui) [14:12:09] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3723245 (10Marostegui) [14:14:28] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3723247 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2085.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017103... [14:33:41] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3723314 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2085.codfw.wmnet'] ``` and were **ALL** successful.