[06:32:31] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3070032 (10Marostegui) db2016 (codfw master) is done: ``` root@db2016.codfw.wmnet[enwiki]> show create table revision\G *************************** 1. row *************... [06:43:20] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3069402 (10Marostegui) Hi, Indeed, the new labs servers (running ROW based replication are fine) and that drift is probably coming from mult... [07:37:19] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3070071 (10bd808) > Since you're continuously curious about Labs frustrations, the replicas are definitely top-five. Noted. We are really hop... [07:53:50] greetings, FYI kart was asking about https://phabricator.wikimedia.org/T146450 as part of SoS [08:01:31] godog, are you sure it was for someone on ops, or in general? [08:05:35] jynus: it was for you (or manuel) specifically [08:06:34] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3070099 (10Marostegui) >>! In T159493#3070071, @bd808 wrote: >> Since you're continuously curious about Labs frustrations, the replicas are de... [08:37:41] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3070129 (10jcrespo) @MZMcBride , why wait when you can go NOW and test the new servers? As many told you, enwiki is there now and fixed. :-) [08:59:13] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3070156 (10jcrespo) 05Open>03Resolved a:03jcrespo ``` root@labsdb1001[enwiki_p]> select ug_group from user_groups join user on user_id =... [09:11:39] BTW, there's a diskspace warning in Icinga for db1047 [09:11:52] no, dbstore1002 I meant [09:12:56] Yep, I will take care of that [09:13:10] Saw the warning earlier, but got busy with something else [09:13:12] will fix it now [09:27:56] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3070174 (10jcrespo) [09:45:48] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070184 (10jcrespo) [09:45:51] 10DBA, 06Labs: Discrepancy between labsdb replicas of arwiki_p.user_groups - https://phabricator.wikimedia.org/T133469#3070182 (10jcrespo) 05Open>03Invalid This may have been true at some point, but I do not see this difference with the given query- labsdb1001 and labsdb1003 are identical, and all of them... [09:49:12] 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3070201 (10jcrespo) [09:49:21] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070202 (10jcrespo) [09:49:24] 10DBA, 06Labs, 10Tool-Labs, 10wikitech.wikimedia.org: labswiki isn't replicated on Labs - https://phabricator.wikimedia.org/T89548#3070199 (10jcrespo) 05Open>03declined I do not think this is going to happen soon due to how special labswiki is- feel free to reopen (or open a new one with a feature requ... [09:52:09] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070210 (10jcrespo) [09:52:12] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#3070203 (10jcrespo) 05Open>03stalled This is stalled- this is definitely going to happen, but we cannot find the time to do it. The scripts are done, we just need to puppetize... [09:58:46] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070219 (10jcrespo) [10:05:51] I have some good news about GTID+multisource [10:05:52] :) [10:05:57] Let me confirm them first [10:06:01] But so far, looking good [10:06:13] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070267 (10jcrespo) [10:06:16] 10DBA, 10ContentTranslation, 06Labs, 07WorkType-NewFunctionality: Replicate ContentTranslation databases on Labs - https://phabricator.wikimedia.org/T119847#3070264 (10jcrespo) 05Open>03stalled We had been requested many times for x1 to be replicated to labs. This is very dangerous, as many tables cont... [10:12:10] 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3070270 (10Tpt) [10:12:19] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070271 (10Tpt) [10:16:26] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3070277 (10jcrespo) [10:16:29] 10DBA, 06Labs: Wrong page title in labs database replica enwiki page table - https://phabricator.wikimedia.org/T136618#3070274 (10jcrespo) 05Open>03Resolved a:03jcrespo Fixed: ``` labsdb1001[enwiki]> SELECT page_id, page_namespace, page_title FROM page where page_id IN (50274778,1272531,976991,50274777... [10:17:59] 10DBA, 06Labs, 10Labs-Infrastructure: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3070279 (10jcrespo) [10:18:05] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#1118490 (10jcrespo) [10:18:08] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3070278 (10jcrespo) [10:24:13] I know why revision is taking so much time to analyze [10:24:37] on db1051 [10:24:48] du -hcs revision* | tail -n 1 -> 454G total [10:27:48] lovely [10:28:20] I am thinking on moving that db to an ssd host, optimize it there and copy it back [10:28:38] it is going to take less time than trying to do it in place [10:28:43] that's not a bad idea, how many hours is that going now? [10:28:55] I have good news about gtid+multisource, let me confirm them though :) [10:28:58] I drop it [10:29:07] the statement, I meant [10:29:22] I was thinking of pooling it back for the weekend [10:30:08] I always prefer to have stuff pooled back for the weekend indeed [10:30:09] good news may be huge thing [10:30:12] It makes me feel better [10:40:05] I think Sean used to run analyze table on dbstores, and the copy the table-based statistics to the servers [10:40:13] *then [10:42:33] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3070295 (10Volans) [10:44:49] the good news aren't good news, as it crashes when mysql is restarted [10:44:55] gtid+multisource is really fragile :( [10:45:02] at least I think I have gotten closer to the root cause [10:47:20] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3070302 (10Marostegui) Let's get it replaced when @Cmjohnson has sometime [10:54:42] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3070309 (10Marostegui) TL;DR = I found a way to enable GTID+multisource but it crashes when MySQL gets restarted which is a no-go for us still. However, I think this will help MariaDB to find th... [11:04:19] db1051 started swapping when I started the analyze command [11:04:27] yesterday? [11:04:48] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=16&fullscreen&var-server=db1051&var-network=eth0&from=now-24h&to=now [11:05:10] oh wow [11:05:59] it probably wasn't very happy since then [11:06:08] and explains why it was taking so much time [11:06:44] well, a 400GB table I would thought it would take a looong time [11:06:52] but yes, swapping probably didn't help at all [11:06:55] not dor analyze [11:07:26] it is just a bit more than a select + some calculations [11:08:01] I am going to reimage and upgrade and all [11:08:36] sure, sounds sane [11:18:21] can I suggest an experiment of thought? it is ok to say no! [11:21:15] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3070335 (10TTO) This column remains unavailable from Labs. For another similar task (T155605) @jcrespo suggested modifying [[https://phabricator.wikimedia.org/diffusion/OPUP/browse/production... [11:24:58] download 5.7.17, see how easy is to do it there :-~ [11:26:48] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3070353 (10Marostegui) The column is available in labs, what is not available is the view indeed. ``` mysql:root@localhost [enwiki]> select @@hostname; +------------+ | @@hostname | +--------... [11:30:22] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3070357 (10Marostegui) >>! In T130067#3068382, @Addshore wrote: > As far as I am aware there would be no reason tha... [11:45:19] 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3070374 (10Marostegui) If by the end of the day the backup hasn't started yet I will run it manually. There are no jobs from March finished just yet: ``` root@helium:/var/log/bacula# cat log.1 | g... [11:56:17] 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3070401 (10jcrespo) ```lang=bash root@helium:~$ echo "list jobs" | bconsole | head -n 10 Connecting to Director helium.eqiad.wmnet:9101 1000 OK: helium.eqiad.wmnet Version: 5.2.5 (26 January 2012) E... [12:00:35] 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3070407 (10Marostegui) Exactly, so for Feb they never took that much time so far: ``` root@helium:~# cal -m 2 February 2017 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14... [12:08:23] marostegui, https://phabricator.wikimedia.org/T159524 [12:08:44] I was reading it right now [12:08:50] THat is a bit mad no? [12:08:50] XD [12:09:37] it is not preventing dbstore1001 to get its backups done, no? [12:09:53] going to grab lunch [12:09:56] see my comments on the other channel [12:15:50] there could be some issue with dbstore backups because old and new keys differe [12:16:46] if that is the case, we may need to reset that job [12:54:38] I thought about the key thing, but then I saw bconsole connecting fine to it (or it reported it connected fine) [13:02:24] :-/ [13:11:30] 10DBA, 06Community-Tech, 10MediaWiki-Categories, 13Patch-For-Review: Increase size of categorylinks.cl_collation column - https://phabricator.wikimedia.org/T158724#3070517 (10jcrespo) > As a temporary solution until some larger table refactoring takes place (which i assume we want to do all at once) - both... [13:31:53] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3070574 (10Marostegui) I have done dbstore2001 and dbs2067 already. (dbstore2002 doesn't have s6): dbstore2001: ``` frwiki ***************************... [14:31:19] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3070692 (10Marostegui) The only way I have found to get it enabled is as follows The main issue seems to come from the masters where there is no way to clean up the coordinates when they used to... [15:16:36] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3070789 (10Cmjohnson) The disk has been swapped. Rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: R... [15:27:52] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3070820 (10Marostegui) Thanks! I will keep an eye and close this ticket once it is all good! [15:50:36] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3070867 (10Marostegui) So for the last test I have tested with the same topology with have in production+sanitarium+labs That is: two masters -> sanitarium (with two replication threads) -> one... [15:57:41] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3070874 (10Marostegui) a:03Marostegui [15:58:22] 10DBA, 13Patch-For-Review: Remove partitions from enwiktionary.templatelinks in s2 - https://phabricator.wikimedia.org/T154097#3070875 (10Marostegui) a:05Marostegui>03None [15:59:09] 10DBA, 13Patch-For-Review: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079#2741270 (10Marostegui) a:05Marostegui>03None [16:02:23] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3070918 (10Marostegui) Shall we go for s5 or s2 next? I am not sure we can do both (because of disk space issues). s2 has more wikis, but s5 has wikidata :) [16:03:37] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3070931 (10jcrespo) s5 for me [16:10:46] jynus: marostegui is it an issue to fit all 7 shards on the new setup? https://phabricator.wikimedia.org/T153743 [16:10:51] that's pretty concerning? [16:11:19] no [16:11:28] he means sanitarium host #2 [16:11:41] we have a sanitaroum host #1 [16:11:50] plus we have asked for replacement hardware [16:11:51] I'm with you ok [16:12:11] yeah, sorry, that was for sanitarium #2 only [16:12:38] current labsdb disk usage is 24%, with 9 free TB [16:13:25] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&var-server=labsdb1009&var-network=eth0&from=now-60d&to=now [16:13:59] all good, I just misunderstood :) [16:14:11] however [16:14:18] we were talking the other day [16:14:33] and something we may have as a requirement for the future [16:14:41] think, in 5 years, [16:14:54] is to have separate servers for separate shards [16:15:12] I do not think the current system will scale in 5 years [16:15:20] I have no doubt [16:15:31] and deprecating it now [16:15:43] but allowing it for backwards compatibility [16:15:43] that's a more of a people problem I think afa tool devs and shared expectations [16:15:45] yeah [16:15:57] would be a good idea when the migration is done [16:16:04] enforce it on ruling, but not technically [16:16:12] give 5 years to accomdate it [16:16:38] I think 5 years is more than enough advance notice :-) [16:16:59] I am more worried about future purchases than anything else [16:17:34] understood [16:17:38] I've had similar thoughts [16:17:53] at my last gig we sharded tables across for similar scale reasons [16:18:08] and it was a pita but what else can you do, vertical scaling just flat out won't work long term [16:18:15] yes, if people wanted like [16:18:23] "all revision tables" [16:18:36] or something along those lines in the sample place that may be duable [16:18:47] but we have to cut at some point [16:19:03] we can also think of federation [16:19:11] virtual tables that exist on other servesr [16:19:20] but that normally is really bad for performance [16:25:13] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3071026 (10Marostegui) >>! In T153743#3070931, @jcrespo wrote: > s5 for me Agreed. If that is the case, I believe db1070 can be a good option to be sanit... [16:34:46] 10DBA, 13Patch-For-Review: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3071032 (10Marostegui) a:03Marostegui Today checksums: nowiki - dbstore1002 and db1047 have differences: ``` Differences on db1047 TABLE CHUNK CNT_DIFF CRC_DIFF CHU... [17:22:12] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3071211 (10jcrespo) 05Open>03Resolved All disks are fine, there are 2 with 1 media errors, and one with 2; but probably they will be ok for a while (one year). [17:23:16] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3071220 (10Marostegui) All good - thanks Chris! ``` root@db1053:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level :... [17:23:42] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3071225 (10Marostegui) Good timing @jcrespo! [17:46:40] marostegui, just one thing- I have merged 2 patches on operations/puppet/mariadb, but I do not want to deploy them to operations/puppet because it is too late, even if they should be noops [17:46:56] I will do that on monday [17:58:42] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3071469 (10Cmjohnson) db1070 is under warranty for 2 more months. Requested new part from DEll Congratulations: Work Order SR944780612 was successfully submitted. [18:02:24] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3071472 (10Cmjohnson) Disk has been replaced. rebuilding cmjohnson@db1056:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online,... [18:06:21] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3029364 (10Cmjohnson) Replaced disk in slot 4. will wait for it to rebuild and then replace slot 7 Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware st... [18:54:45] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3071598 (10Marostegui) Excellent! Thank you! [18:55:02] 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, 06Performance-Team, and 2 others: Logging needs an index to optimize searching by log_title - https://phabricator.wikimedia.org/T68961#3071603 (10Umherirrender) >>! In T68961#3064637, @Huji wrote: > @Umherirrender may I asked what workaround you are talking... [19:16:23] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3071684 (10Marostegui) Thanks Chris! It will take a long time, so probably best to replace 7 on Monday :-) ``` root@db1060:~# megacli -PDRbld -ShowProg -PhysDrv [32:4] -aALL Rebuild Progress on Device at E... [20:11:45] 10DBA, 06Labs: labsdb1004 MySQL crash - https://phabricator.wikimedia.org/T159572#3071865 (10Marostegui) [20:18:42] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3071891 (10Marostegui) 05Open>03Resolved a:03Marostegui This looks good now! Thanks ``` root@db1056:~# megacli -PDRbld -ShowProg -PhysDrv [32:1] -aALL Device(Encl-32 Slot-1) is not in rebuild process... [20:19:46] 10DBA, 06Labs: labsdb1004 MySQL crash - https://phabricator.wikimedia.org/T159572#3071897 (10Marostegui) Just to be clear, the server is UP, but I have left replication stopped so it doesn't crash the whole server when it comes to that transaction :-) [20:24:28] 10DBA, 06Labs: labsdb1004 MySQL crash - https://phabricator.wikimedia.org/T159572#3071901 (10chasemp) I honestly have no idea, I don't think I've had any contact with this setup. I hope @jynus or @yuvipanda has some wisdom. This seems super weird. [20:42:26] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3071968 (10MZMcBride) >>! In T159493#3070129, @jcrespo wrote: > @MZMcBride , why wait when you can go NOW and test the new servers? As many to... [20:43:57] 10DBA, 06Labs, 10Labs-Infrastructure: Data integrity issue with enwiki_p user_groups on Wikimedia Tool Labs (missing rows) - https://phabricator.wikimedia.org/T159493#3071970 (10MZMcBride) Related question: in scripts, I connect to `enwiki_p` for the database name and `enwiki.labsdb` for the database host na...