[05:25:27] 10DBA: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4156169 (10Marostegui) p:05Triage>03Normal [05:26:15] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4156182 (10Marostegui) [05:27:45] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Marostegui) [05:35:14] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4156186 (10Marostegui) [05:36:02] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Marostegui) @Cmjohnson I have confirmed that all the hosts with the exception of db1120 as you mentioned, are up and ready - let's keep this opened till it is fixed. T... [06:35:40] ERROR 1799 (HY000): Creating index 'PRIMARY' required more than 'innodb_online_alter_log_max_size' bytes of modification log. Please try again. [06:35:57] Ah yeah, that is a known issue to me too [07:34:42] there is prometheus running for all sections on db1115 [07:34:51] *db1116 [07:35:01] should I kill a few? [07:35:19] Oh, I will take care of it [07:35:23] thanks for letting me know! [07:35:50] when removing section, you need to delete files on several locations [07:36:08] it is not done automatically [07:37:15] I guess the prometheus exporters? [07:37:49] also for mysql [07:37:59] yeah, I removed those already [07:38:03] mostly config on etc [07:39:04] /etc/default/prometheus-mysqld-exporter* [07:39:13] yeah, just did those :) [07:40:07] I mention it because systemctl was alerting [07:40:13] yeah yeah [07:40:16] I am glad you did [07:40:45] I can add the ensure => absent [07:40:55] but I didn't want in case of accidental deletion [07:41:48] I can setup other host in core if you are ok with that [07:41:51] yeah, I think it is better not to [07:41:54] jynus: sure! [07:41:57] Let me update the new ticket [07:42:33] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4156316 (10Marostegui) [07:45:07] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4156319 (10jcrespo) [08:15:29] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4156401 (10jcrespo) [08:21:49] 10DBA, 10Operations-Software-Development: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875#4156409 (10Marostegui) I am fine placing this on m2 as Jaime originally suggested. [08:22:55] I am chosing s2 because it has 2 "old" servers, and db1122 because it has D row,of which there is none at s2 [08:23:16] So we will be able to free up the vslow host there and move it to misc? [08:23:46] Maybe you can also use it for s4, which has the other old host [08:24:01] not sure if I will place it as multiinstance [08:24:26] Why not? [08:26:18] there is too many D hosts on s4 [08:26:51] maybe for s7 then? [08:27:01] I prefer to convert an existing s2 host from C [08:27:09] ah sure, that works too [08:27:55] it is a triple move, but I think it will be a nicer setup [08:28:17] yeah [08:28:19] 22 substitutes 90 which substitutes db1060 [08:28:28] haha [08:28:53] yeah, makes sense [08:28:54] I am not sure about that either, maybe s4 needs dedicated vslows [08:29:10] right now it is a "slow" host [08:29:13] So I am going to pool it, remove old hosts [08:29:23] So probably the SSDs will compensate the smaller buffer pool [08:29:25] and we can rearrange later [08:29:33] sure [08:32:08] leaving it replicating for some days, etc. [08:32:39] yeah [08:33:19] and with the dump replicas arriving later, I am not sure multi-instance is the way, at least for all sections [08:34:00] and thinking too much ahead... I just want to remove the old servers for now [08:34:07] sure, let's stick to that [08:34:37] and fixing the row gaps [08:39:26] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4156461 (10Marostegui) [09:01:32] 10DBA, 10Dumps-Generation: Some dump hosts are accessing main traffic servers - https://phabricator.wikimedia.org/T143870#4156551 (10jcrespo) `dumpsgen 4973 /usr/bin/php5 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=zhwiki` is accessing a main traffic (no `dump`) server at snapshot1007, and... [09:32:07] 10DBA, 10Operations-Software-Development: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875#4156668 (10jcrespo) @Volans you can speed up the process by setting some password on the private repository (and some non-private equivalent in the labs/private one), and suggesting... [09:56:08] should we convert the enwiki remaining myisam table to innodb? [09:56:22] there are still myisam tables there? [09:56:44] 1 [09:56:50] which one is that? [09:56:55] bv_editsSOMEYEAR [09:57:06] :| [09:57:23] I guess it is fine to convert them (I wonmder if they are in use) [09:57:36] They are not [09:57:41] per T54921 [09:57:42] T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 [09:57:43] maybe it is scheduled for deletion? [09:57:47] yep [09:57:48] they are [09:57:57] 2009, 2011 and 2015 [09:58:02] (don't know if there are more) [09:58:27] I think they are [09:59:53] On that task, only 2009 2011 and 2015 appear to be scheduled for deletion [10:00:20] when handled, we can create a task with all existing ans ask again just in case [10:04:54] Amir1: the logging table can be optimized already in s8? I was thinking that I have to do an alter table on that table which is currently 430G, and I was wondering if it can be optimized already so we can free up some space there [10:16:47] marostegui: the table is half of its size now [10:17:02] but the plan is to reduce it to 1% of its size [10:17:23] so 300M more rows to delete but 360M already deleted [10:17:28] as you wish :D [10:17:29] we have a purging issue that I don't like [10:17:51] Amir1: Then I will wait and proceed with other shards before I hit s8 then :) [10:17:56] When do you think it will be ready for it? [10:18:45] we may want to clean up dbstore2001 in advance, marostegui: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13318 [10:19:03] uf [10:19:03] marostegui: it needs jynus to approve because he said I need to stop deleting [10:19:22] or rebuild it or something [10:19:26] or it will explode [10:19:44] it may not be the only server [10:20:04] I can make it stop, but increasing the purge threads, but then it starts lagging [10:20:26] yeah [10:20:41] it is madness [10:20:43] I can try again, now that backups should be complete [10:20:53] They are still running on 2001 [10:21:11] maybe wikibase has a leak issue [10:21:13] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13318&panelId=11&fullscreen&from=now-30d&to=now [10:21:15] with too many deletes? [10:21:17] scary [10:24:35] maybe we can just rebuild s8 there [10:24:59] let me try rebuilding logging once [10:25:06] we have really nothing to lose [10:25:12] rebuilding the table you mean? [10:25:17] yes [10:25:23] sounds good [10:25:24] and or increasing again the purge threads [10:25:36] I wanted to run an alter on s2 in codfw - should I hold? [10:25:48] no problem [10:25:52] this is s8 [10:25:59] if it goes slow, it goes slow [10:26:04] sure, but I didn't want to add more IO there :) [10:26:44] I'll wait for the s2 backups to be done [10:27:55] I've started an ALTER TABLE logging ENGINE=InnoDB row_format=Compressed, FORCE; [10:28:22] cool [10:28:31] -rw-rw---- 1 mysql mysql 217G Apr 25 10:20 logging.ibd [10:28:37] let's see how much it shrinks [10:28:41] I will stop replication [10:28:41] Amir1: ^ [10:29:55] thanks [10:29:57] but if it keeps happening it may not be logging but other tables [10:30:07] I'm not deleting anything on s8 for now [10:30:28] deletes are very low according to stats, too [10:30:32] so it is quite strange [10:30:47] that is why I didn't want to add more issues at the moment, Amir1 [10:31:26] maybe the compression + low io makes it very laggy, still [10:31:47] I trust you fully, if you say stop, there is no need to say more :) [10:32:17] Amir1: the thing is, if it continues in that trend, things end up exploding (crashes) [10:32:44] jynus: marostegui I'm deleting logging from dewiki (probably already done) and commons (this is pretty slow but ongoing) do you want me to stop them too? [10:33:03] no, those caused no issues that I can see [10:33:07] but let me double-check [10:33:41] s4 is fine so far [10:33:43] I have deleted couple of millions of rows on all wikis [10:34:03] all small and medium are already done, half of large wikis as well [10:34:11] but it is only s8 with issues [10:34:18] all others are ok [10:34:32] noted [10:35:03] could be unrelated and be a host only issue, but we need time to solve it [10:35:16] will ping you on phabricator when we are back in a good state [10:36:51] is there a phabricator for this issue? [10:37:29] no, I will ping you in the "delete logging rows" one [10:38:36] marostegui: I got 2 more duplicate errors while compressing dbstore1001 [10:38:44] ERROR 1062 (23000) at line 1: Duplicate entry '1047393233' for key 'ts_rc_id' [10:38:50] ERROR 1062 (23000) at line 1: Duplicate entry 'Q30325899-T-44294723' for key 'eu_entity_id' [10:39:06] can you try those again with replication enabled? [10:39:16] doing right now [10:39:22] I will need to compress db1116.s1 so we will see if I get duplicates too [10:39:36] tag_summary and wbc_entity_usage [10:39:53] both friends of us [10:40:37] both with likely ongoing replication activity [10:43:39] "sqldata/zhwiki/revision_broken.ibd" [10:44:06] ? [10:44:24] lots of garbage on the servers [10:44:31] (I guess) [10:45:11] never seen that before [11:06:30] the tag_summary error happend again [11:06:41] the other hasn't finished yet [11:06:57] the key that failed is different, though [11:07:08] I think this is an alter table issue [11:33:12] https://tendril.wikimedia.org/host/view/db1122.eqiad.wmnet/3306 [11:33:20] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1122&var-port=9104 [11:35:32] I will reimage db1090 after lunch [12:47:08] tag summary fails every time, each time on a different id [12:48:00] https://phabricator.wikimedia.org/P7038 [12:49:08] I am going to try once again with replication stopped [12:49:31] pffff [12:49:41] maybe it has too much activity? [12:50:06] if it has too much, then it a "normal" error, and it complains about the buffer [12:50:30] I think this is a bug [12:50:39] :( [12:51:04] in the past I would have said it is old data types or something [12:51:20] but this is a newly imported table so to start from 10.1 directly [12:51:52] maybe it happens again with replication off [12:51:56] let's see [12:51:58] we'll see [12:53:47] I think I will move all db1060's roles to db1090 [12:53:52] the decom db1060 [12:53:55] *then [12:54:15] cool [13:00:12] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4157419 (10jcrespo) [13:21:55] it worked first try with replication stopped [13:22:28] is it worth reporting? [13:23:07] I think yes, the problem is to make it reproducible [13:23:12] Yeah, that's the thing [13:57:35] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4157635 (10Marostegui) [14:04:17] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4157672 (10jcrespo) [15:29:14] @marostegui @jynus db1120 is doing the initial puppet run after that all 8 of the new db's are ready for you. [15:30:10] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4158048 (10Cmjohnson) [15:30:16] thank you very much chris! [15:31:18] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Cmjohnson) a:05Cmjohnson>03Marostegui @marostegui db1120 is fixed, i had the ethernet cable in the wrong port :-(. Assigning to you and removing ops-eqiad tag [15:31:24] YW...resolve that task if you no longer need it [15:33:14] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4158081 (10Marostegui) 05Open>03Resolved a:05Marostegui>03Cmjohnson Confirmed db1120 looks good! Thanks @Cmjohnson! [15:33:28] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4158086 (10Marostegui) [15:33:49] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4158088 (10Marostegui) [18:33:09] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare storage layer for euwikisource - https://phabricator.wikimedia.org/T189466#4158969 (10Bstorm) The wiki replica views and DNS are set. [18:33:42] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345, 10cloud-services-team (Kanban): Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4158971 (10Bstorm) The wiki replica views and DNS are set. -- and updated that document so it is now correct. [18:34:04] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4158972 (10Bstorm) The wiki replica views and DNS are set. [18:34:21] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4158973 (10Bstorm) The wiki replica views and DNS are set. [18:34:53] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4158976 (10Bstorm) The wiki replica views and DNS are set, and corrected mentioned doc. [18:35:12] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4158978 (10Bstorm) The wiki replica views and DNS are set. [18:56:03] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare storage layer for euwikisource - https://phabricator.wikimedia.org/T189466#4159072 (10Bstorm) 05Open>03Resolved [18:56:33] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345, 10cloud-services-team (Kanban): Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4159074 (10Bstorm) 05Open>03Resolved [18:56:58] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4159076 (10Bstorm) 05Open>03Resolved [18:57:09] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4159090 (10Bstorm) 05Open>03Resolved [18:57:47] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4159092 (10Bstorm) 05Open>03Resolved [18:58:09] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4159095 (10Bstorm) 05Open>03Resolved