[06:57:53] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3097126 (10Marostegui) db1083: ``` root@PRODUCTION s1[enwiki]> show create table revision\G *************************** 1. row *************************** Table:... [07:11:10] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2360296 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1070.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201703140710_maroste... [07:18:08] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3097166 (10Marostegui) dbstore1001 is done: ``` root@neodymium:~# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdbstore1001 $i -e "show c... [07:20:13] 07Blocked-on-schema-change, 10DBA, 05MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)), 05MW-1.28-release-notes, 13Patch-For-Review: Clean up revision UNIQUE indexes - https://phabricator.wikimedia.org/T142725#3097169 (10Marostegui) [07:20:17] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3097170 (10Marostegui) [07:20:20] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3097167 (10Marostegui) 05Open>03Resolved I started the ALTER last night on labsdb1003 to have it at least on one of the labs servers. It is already... [07:24:36] 07Blocked-on-schema-change, 10DBA, 05MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)), 05MW-1.28-release-notes, 13Patch-For-Review: Clean up revision UNIQUE indexes - https://phabricator.wikimedia.org/T142725#2544418 (10Marostegui) An update on what's going on on the work we have been doing lately... [07:44:55] 10DBA: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390#3097179 (10Marostegui) [07:50:24] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3097194 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1070.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201703140750_maroste... [08:08:56] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3097211 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1070.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201703140808_maroste... [08:12:39] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3097213 (10Marostegui) After some troubleshooting as the server wasn't getting reimaged I found that db1070 is suffering this: ``` Error: Unable to establish IPMI v2 / RMCP+ session ``` And it is indeed listed here: T... [08:19:57] 10DBA, 06Operations, 13Patch-For-Review: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#3097232 (10Marostegui) [08:19:59] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3097230 (10Marostegui) [08:29:06] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3097248 (10Marostegui) [09:05:14] 10DBA: s5: db1070 not using file per table - https://phabricator.wikimedia.org/T157931#3097283 (10Marostegui) [09:05:16] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3097282 (10Marostegui) [09:07:10] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2360296 (10Marostegui) db1070 has been manually reimaged and it is now getting the mysqldump from yesteday imported back: ``` root@db1070:/srv/sqldata/dewiki# lsb_release -a No LSB modules are available. Distributor ID:... [09:25:24] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3097293 (10Marostegui) And labsdb1003 finished: ``` [root@labsdb1003 09:24 /root] # for i in frwiki jawiki ruwiki; do mysql --skip-ssl $i -e "show creat... [10:23:11] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638#3097450 (10Marostegui) in db2057 (s3) I have renamed all the echo tables to: ``` T153638_echo_xxxx ``` Just to make sure replication doesn't break, I will leave it like that for a few days before dropping them (after t... [10:34:32] oh, I see [10:45:15] 10DBA, 10Analytics, 06Labs: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#3097467 (10JAllemandou) I discussed this with @ArielGlenn. He told me he wouold investigate. Ping @ArielGlenn? [11:45:36] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3097543 (10Marostegui) After having a chat with Jaime, he correctly pointed out that we should try to load in parallel to avoid a week of importing for this host. So what I have done is: - split the file into databases:... [12:21:00] I intend to reimage db1057 later [12:21:23] sounds good [13:13:27] 07Blocked-on-schema-change, 10DBA: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097698 (10Nikerabbit) [13:13:43] 07Blocked-on-schema-change, 10DBA: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097710 (10Nikerabbit) [13:17:44] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097738 (10KartikMistry) [13:41:35] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097808 (10jcrespo) a:03jcrespo [13:43:53] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097698 (10Marostegui) This needs to be run on x1. The table isn't too big, so probably can be run on the master... [13:45:29] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097824 (10jcrespo) a:05jcrespo>03None You can take it if you want. This would be a great opportunity to tes... [13:46:33] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097828 (10Marostegui) >>! In T160407#3097824, @jcrespo wrote: > You can take it if you want. This would be a gr... [13:55:49] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097840 (10Marostegui) So, tomorrow morning I will run the following on all the x1 slaves: ``` stop slave; set g... [13:57:03] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097842 (10jcrespo) The plan would be running: ``` ./software/dbtools/osc_host.sh --host=db1031.codfw.wmnet --d... [13:58:41] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097843 (10Marostegui) >>! In T160407#3097842, @jcrespo wrote: > The plan would be running: > > ``` > ./softwar... [14:03:44] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3097846 (10Marostegui) db1080 is done but I am wondering if we should continue altering the pending slaves or wait for this to be fixed: T159319#3097521 ``` root@neody... [14:06:23] marostegui: jynus mornin :) just a question whenever you get a sec. trying to take a look at https://phabricator.wikimedia.org/T154355. Whenever I try to replace the view there for the page table for enwiki the connection hangs [14:06:28] I can't seem to figure out what it's doing [14:06:40] 2017-03-14 14:02:17,012 INFO Full views for enwiki: [14:06:41] 2017-03-14 14:02:17,014 INFO [page] [14:06:41] 2017-03-14 14:02:17,015 DEBUG SQL: [14:06:42] CREATE OR REPLACE [14:06:42] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097851 (10jcrespo) @marostegui Please give it a second look: ``` $ ./software/dbtools/osc_host.sh --host=db103... [14:06:44] DEFINER=viewmaster [14:06:46] VIEW `enwiki_p`.`page` [14:06:48] AS SELECT * FROM `enwiki`.`page`; [14:06:57] hanging now in labsdb1001 [14:07:08] labsdb1001 doesn't have the schema change [14:07:17] I mentioned that several times on that ticket [14:07:47] it is impossible to have it done without killing all connections for days [14:08:05] https://phabricator.wikimedia.org/T154355#2919601 [14:08:32] ok thanks, sorry I missed that, I'm not sure why that would result in this behavior necessarily [14:08:38] is labsdb1003 the same? [14:08:43] chasemp, probably [14:08:58] the same reason why I cannot add it in the first place [14:09:01] metadata locking [14:09:12] there are continuously long running queries [14:09:29] so it gets locked and it also blocks all selects after it [14:09:38] so kill your trials as soon as possible [14:09:55] killed, and I'll rerun w/ a hack to skip enwiki [14:09:57] please modify your code to add "SET SESSION lock_wait_timeout=10" [14:10:10] that will make your queries fail after 10 seconds [14:10:15] instead of infinity [14:10:15] gotcha [14:10:29] the difference with the new setup: [14:10:37] we can depool and repool at will [14:10:41] no more of this issue [14:11:13] we decided not to spend weeks on this issue for the old servers [14:11:14] in fact [14:11:24] you can create that view with a constant value [14:11:43] and you would get perfect results (that column will always be null for enwiki) [14:11:52] which is why it is a stupid change [14:12:05] stupid == not worth our time [14:12:28] sorry, I forgot to answer your question, chasemp [14:12:35] it is on labsdb1003 and on the new servers [14:13:12] why is labsdb1003 not the same as labsdb1001? [14:13:27] because labsdb1003 is not used for enwiki [14:13:30] (keep in mind I don't care necessarily) [14:13:32] ah right [14:13:33] there is high contention on enwiki [14:17:43] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097891 (10Marostegui) Some observartions: - this case we'd need to remove: `--no-replicate` as we do want it to... [14:20:28] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 3 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097899 (10jcrespo) ``` $ ./software/dbtools/osc_host.sh --host=db1031.eqiad.wmnet --db=wikishared --table=cx_tr... [14:22:10] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 3 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097904 (10Marostegui) >>! In T160407#3097899, @jcrespo wrote: > ``` > $ ./software/dbtools/osc_host.sh --host=d... [14:22:14] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 3 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097905 (10jcrespo) a:03Marostegui How much do we trust slave_parallel_threads, BTW? I would leave this to you... [14:26:06] 07Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 3 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#3097934 (10Marostegui) >>! In T160407#3097905, @jcrespo wrote: > How much do we trust slave_parallel_threads, BT... [14:38:00] jynus: there are 2 that seem not to return in teh 10s timeframe on labsdb1003: 'wikidatawiki’, ‘dewiki’ [14:38:11] yeah [14:38:22] welcome to my world of metadata locking [14:38:31] :) I'm going to note them as excluded on this setup [14:38:34] for your case you coud make it work [14:38:36] and move on, cool w/ you? [14:38:46] just kill the long running connections using them [14:39:14] they should not block the tables/views in the first place [14:39:59] or just do that, you are the owner of the labs views! [14:40:11] I am happy with any decision you take there [14:40:26] you only need to ask if you have to change the actual tables/data [14:40:50] is there a sane way to narrow down what queries are using those views and kill them? [14:41:02] pfff [14:41:13] SHOW PROCESSLIST | grep wikidatawiki [14:41:28] SHOW FULL PROCESSLIST | grep wikidatawiki | grep page [14:41:30] maybe ? [14:41:44] ok right :) I didn't that would work how I imagine it for some reason [14:41:53] well, it doesn't [14:42:12] you can do "pager grep wikidatawiki | grep page" on the command line [14:42:31] or mysql -e "SHOW FULL PROCESSLIST" | grep bla bla [14:43:20] 81713521 s51999 10.68.20.251:46750 wikidatawiki_p Query 101 Queried about 154850000 rows select DISTINCT epp_entity_id from wb_entity_per_page,pagelinks WHERE epp_entity_id IN (636356,23020716) AND epp_page_id=pl_from AND pl_namespace=0 AND pl_title IN ('Q4167410','Q4167836','Q11266439','Q13406463','Q17362920','Q17633526') 0.00 [14:43:27] ha [14:49:33] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3098013 (10chasemp) a:05chasemp>03TTO As far as I know this is all deployed as intended now, please validate. [16:18:45] db1057 rebooted, but I have not idea what it is doing (and I am connected to a serial console) [16:21:55] but it got installed fine? [16:22:01] or not even? [16:22:02] no idea [16:22:16] it says it is on [16:22:22] but as if it didn't matter [16:22:47] what do the wmf-reimage logs say? [16:23:02] it is stuck [16:23:44] I think it didn't come back after reboot [16:24:07] I am going to force a powercycle [16:24:31] at least you'll be able to see how it power ups [16:24:36] powers up [16:24:45] IF [16:24:54] i didn't want to say it :( [16:25:41] I powercycled, but no response on the output yet- I will wait, then I will do a hard reset [16:26:08] just in case, remember we have some uncommented servers in s5 or s6 I believe [16:26:11] *commented [16:26:24] what do you mean? [16:26:48] you mean like spares? [16:26:51] yeah [16:27:03] this was technically a spare in the first place [16:27:06] not too worried [16:27:22] and I have the data, which was more valuable [16:28:17] yes yes, just saying jsut in case you didn't want to have s1 without the old master for a long time (or for a long as it takes to fix db1057) [16:28:20] still not booting up? [16:28:36] nope, hard resetting [16:28:38] and logging [16:29:28] fingers crossed [16:32:20] still not working [16:32:49] I am going to try to power it down [16:33:02] does the power cycle acutally work? [16:33:09] yeah, try the power down [16:33:12] and let's see if it powers down [16:33:24] yeah, that's the plan [16:33:28] if it is unresponsive [16:33:35] maybe the power source is bricked [16:33:42] *power supply [16:34:43] Server power status: OFF [16:35:01] Server power status: ON [16:35:04] at least that works [16:35:24] yeah, it is responsive [16:35:35] but I am suspecting the server itself is not [16:35:57] powersuply or something that makes not even the bios work [16:35:58] maybe we can ask chris tomorrow to plug a crash cart and see what he sees on the screen [16:36:38] it could be also the serial port [16:36:47] but the installation would have worked [16:36:51] exactly [16:36:56] and we would eventually be able to ping it [16:37:02] yes [16:37:06] even during install [16:37:24] I think it went down, never got up again [16:37:38] yeah [16:37:48] i would leave it off [16:37:50] so it can cool down [16:37:57] maybe there is something on the logs [16:37:57] and check with chris tomorrow [16:38:09] my bet is that it is not even up [16:38:17] but nothing against it [16:38:25] let me first search for logs [16:38:46] sure [16:41:53] "Drive 10 is installed in disk drive bay 1." [16:42:07] uh [16:42:12] that is normal [16:42:22] power supply readings normal [16:44:50] 6 normal resets, I think I only did 2 [16:45:05] plus the one from the reimage? [16:45:21] and maybe if it got installed fine, another one [16:45:30] maybe [16:51:01] going to log off for now, I will see you tomorrow! [16:53:21] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3098547 (10Marostegui) db1070 keeps importing stuff, in parallel (main and biggest tables first) so far: ``` root@db1070:/srv/sqldata# du -sh dewiki/ wikidatawiki/ 248G dewiki/ 78G wikidatawiki/ ``` [16:58:20] better finding out like this than in the middle of an important operation [18:56:38] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099166 (10Nuria) [18:58:09] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099215 (10Nuria) p:05Triage>03High [18:58:33] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099219 (10jcrespo) Why not deploy a new schema instead? Are the already inserted user agents going to change? [18:59:10] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099221 (10Nuria) [18:59:47] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099166 (10Nuria) [19:09:43] 10DBA, 06Labs, 10wikitech.wikimedia.org: SemanticMediaWiki tries to create temporary tables, but can't as wikiuser is restricted - https://phabricator.wikimedia.org/T110981#3099283 (10Bawolff) 05Open>03declined > If this was me, I would close it as won't fix You are the DBA, if the status of this ticket... [19:12:22] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099300 (10Nuria) >Are the already inserted user agents going to change? no >Why not deploy a new schema instead? A new schema would not change length of column as that is hardcoded on db so... [19:16:12] 10DBA, 13Patch-For-Review: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3099319 (10Marostegui) svwiki has no differences thwiki - differences on db1047 and dbstore1002 trwiki - in progress now [19:26:45] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099400 (10jcrespo) > other than alter would need to run in a smaller set of tables, newly created. For me that would be a huge win- it could be deployed in seconds, rather than weeks. > Cha... [19:38:51] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3099433 (10jcrespo) To be clear: there are 3 machines with eventlogging/analytics stuff (among other)- db1046, db1047 and dbstore1002 (this last one should still be ok for a c... [19:55:43] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099462 (10Nuria) >other than alter would need to run in a smaller set of tables, newly created. Sorry, I did not explain this well-enough. A capsule change would not help little in this case... [20:10:47] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3099522 (10Ottomata) Oh yeah, rats, I totally forgot to put this in our budget request. Hm. do db1046 and db1047 host just EL data, or also wiki dbs? [20:19:32] 10DBA, 10Flow, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), 05MW-1.27-release (WMF-deploy-2015-12-08_(1.27.0-wmf.8)), and 2 others: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509#3099573 (10Mattflaschen-WMF) a:05matthiasmullie>03None [20:24:50] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3099624 (10TTO) `metawiki_p.page` now contains the page_lang column; however, `user_groups` view still dos not contain the `ug_expiry` column. Shall I open a new task for that? [21:28:59] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3099979 (10TerraCodes) [22:39:51] 10DBA, 06Analytics-Kanban: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3099166 (10Nuria) p:05High>03Triage [23:07:29] 10DBA, 10CheckUser, 06Community-Tech: Investigation: Add old and new length columns to cu_changes - https://phabricator.wikimedia.org/T155734#3100308 (10DannyH) [23:07:51] 10DBA, 06Community-Tech, 10MediaWiki-User-blocking: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3100311 (10DannyH)