[06:18:38] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3995318 (10Marostegui) 05Open>03Resolved All good now - thanks Papaul! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [06:47:43] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995363 (10Marostegui) [06:57:36] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995366 (10Marostegui) Does the /srv/ still needs fixing? I see it on fstab but I am not sure whether it is working or not. I have rebooted db2093 just to test... [06:59:50] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995369 (10Marostegui) [07:00:26] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3892870 (10Marostegui) [07:01:55] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995373 (10Marostegui) ``` root@db1115:~# dpkg -l | grep libodbc1 ii libodbc1:amd64 2.3.4-1 amd64 ODBC library for... [07:02:04] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995374 (10Marostegui) [07:27:57] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995386 (10Marostegui) I was thinking that it wouldn't hurt to copy all the content of db1011 to somewhere else just in case. Even db2093 or esXXXX or dbstore10... [07:31:31] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3995389 (10Marostegui) [09:35:01] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995598 (10jcrespo) [09:36:24] db2037 doesn't seem to allow tendril connections [09:40:09] I will check it, because it has (or should have) the same grants as db1009 [09:54:56] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995630 (10Marostegui) [09:56:37] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3892870 (10Marostegui) Dropping a host works fine. Adding a host fails with: ``` ERROR 1105 (HY000) at line 338: (1045) Access denied for user 'watchdog'@'10.64... [09:56:42] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995632 (10jcrespo) Let me test some config changes first. [09:57:09] ^ ok :) [10:08:07] marostegui: think this [10:08:16] db1115 wasn't on tendril [10:08:21] now it is [10:08:28] jynus: did you ruin mysql_upgrade on it? [10:08:31] run [10:08:50] yes [10:08:55] no, I am answering Test adding/dropping host scripts [10:08:58] cool, just discarding issues [10:09:09] you mean db2037? [10:09:17] I did not set up, you did [10:09:19] I didn't touch it [10:09:32] we are mixing things :) [10:09:36] I added db1115 to the current host [10:09:47] so addind and deleting hosts works [10:09:53] ah cool [10:09:55] what you are seeing is a problem with the host [10:10:02] did you test it on db2037? [10:10:13] or where? [10:10:20] No, I was discarding issues [10:10:30] Access denied for user 'watchdog'@'10.64.0.122' (using password: YES) [10:10:32] And thought about dropping/adding it from tendril as a test [10:10:35] where did you test it? [10:10:38] db1115 [10:10:45] no, with what host [10:11:12] I was removing and adding db2037 from db1115 [10:11:16] ok [10:11:23] so your problem is db2037 [10:11:43] not tendril- aka some grants on db2037 are wrong [10:11:47] yes, but as I said, I was discarding issues :) [10:12:01] yes, no problem with that [10:12:07] that is the weird thing that db2037 has the same grants as db1009 [10:12:11] I was talking about the bullet point [10:12:19] [] Test adding/dropping host scripts [10:12:23] that was already done [10:12:27] ah cool, didn't know [10:12:39] that is why I am telling you :-( [10:12:41] :-) [10:12:47] marked as fixed then! [10:12:49] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995663 (10Marostegui) [10:13:15] and as a proof, I added db1115 and db2090 [10:13:17] maybe others [10:13:32] cooool [10:13:36] s5 could have wrong grants, or with the wrong users [10:14:03] I created some with tendril user before realizing it usually was other being used [10:14:12] but 1009 reports good on tendril [10:14:14] so probably it is that [10:14:19] and db2037 has the same ones (just did a diff) [10:14:29] yeah, but probably 1009 was created with a different user [10:14:54] yeah, but if db1009 is showing up fine, db2037 should too, no? [10:14:58] I also checked iptables [10:15:14] no, if you create them with another user [10:15:31] check the user used for db1009, I bet it is a diferent one than most others [10:19:49] it has a normal tendril user from what I am seeing now [10:20:05] normal tendril user? XD [10:20:24] It has a tendril user with the same hash as any other host [10:20:30] I think you don't play attention to processlist on 99% of the hosts [10:21:01] go to one that is not m5 and run show processlist, you will see what I am saying :-) [10:21:53] I was checking the watchdog user now [10:22:04] look, this was my mistake, ok? I built m5 master badly, it was my first host [10:22:22] I see it now [10:22:41] I can see how it is confusing [10:22:45] I created the confusion [10:23:20] but we literally read the user like multiple times a day, so I asked to do so to se how silly we are :-) [10:24:38] so my bet is a grant problem [10:24:52] I just added the grant for watchdog@10.% [10:24:54] which was missing there [10:25:01] if it was me, I would drop db1005 too [10:25:12] so we do not have any other tendril account [10:25:34] well, let's wait to see if that fixes the issue XD [10:25:44] he he [10:26:08] hey, I apologize for the mistake, I think that was literally my first task [10:26:28] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995701 (10Marostegui) I have changed: https://wikitech.wikimedia.org/wiki/Tendril [10:26:37] nothing to apologize for! it is messy [10:27:10] actually, it is not db1005, it is db1009, right? [10:27:15] yes, 1009 [10:27:51] 1009 had 'watchdog'@'10.64.0.15' [10:27:53] my first task https://phabricator.wikimedia.org/T98958 [10:28:02] I even had to create my own tasks back then [10:28:24] I think I started working on the 11 and that was created by me on the 13th [10:28:42] However, 1009 doesn't have 'watchdog'@'10.%' so I wonder how it keeps having stats up-to-date on tendril [10:28:54] you said it yourself [10:29:02] the normal[sic] tendril user [10:29:13] jesus... [10:29:30] so it works, just with a different user [10:29:53] if you had created the entry with the right[sic] user, it would have worked [10:30:25] db2037 still not reporting, so I am starting to doubt if there is anything else from watchdog@10.% missing there [10:30:41] did you recreate the host? [10:30:49] no [10:30:54] you must [10:30:55] do I have to? [10:31:15] ok, done now [10:31:17] now imagine solving all of that for yourself [10:31:23] why do I have to recreate it? [10:31:33] not recreate it [10:31:39] more like drop it and create it again [10:31:44] yeah yeah, I did that [10:31:48] But why is that necessary? [10:31:55] I can tell you, and will [10:31:59] (now it works) [10:32:01] I am just commenting that [10:32:10] I did not create tendril, ok? [10:32:27] https://media.giphy.com/media/d2lcHJTG5Tscg/giphy.gif [10:32:28] so I am telling you just fact, without agreeing with them, ok? [10:32:48] user is stored on the tendril database [10:32:57] and used to poll the hosts [10:33:16] the creation fails [10:33:30] because it creates connect tables [10:33:51] and as far as I know, connect tables have to work on creation [10:33:58] so tables fail and data is not gathered [10:34:24] buffff [10:35:20] now, that I had to discover mostly by myself [10:35:28] how long did that take? XD [10:35:43] so of course I created stuff with a different user, but not on purpose [10:35:52] it just worked [10:36:32] only 2 days agon I understood how tendril works in detail [10:36:55] and I think it was a genius idea, given the constraints [10:36:56] hahaha [10:37:12] one person without help and without other montoring tools [10:37:16] you do what you can [10:37:21] yeah, it works greatly, but that logic is hard to discover, it is like reverse engineering [10:37:38] I love tendril yes [10:37:41] it is not "like", it is reverse engineering [10:37:59] now, the question is, is it still scaling to the current sizes? [10:38:06] is it compatible with performance schema? [10:38:16] does it overlap with prometheus? [10:38:39] is connect a reliable engine moving forward? [10:38:50] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3995753 (10Marostegui) [10:39:40] so db2037 seems to be working again [10:39:51] yeah, straight away after drop/add [10:40:09] do you want to recreate db1009 with the new user or not worth it? [10:40:32] yeah, let's leave it in a good state just in case [10:40:33] we should check for other instances of tendril user on other hosts [10:45:00] db1009 is now "clean" [10:46:07] I honestly don't know how much time it passed between my first host and when I used a different username [10:46:16] there could be more with the other user [10:46:37] also drop the other user so it no longer exists on taht replica set and it is recreated accidentaly, if you can [10:47:06] yeah, it is dropped on 1009 and db2037 [10:47:10] I am quite sure it is not used for anything else [10:47:28] can you run a query to see if it exists or it is in use somewhere else? [10:47:53] vs the other user [10:48:04] haha I was just doing that :) [10:50:40] looks like there are no more hosts with that user [10:51:26] cool [10:54:49] how long does the tree takes to get updated in tendril?: https://tendril.wikimedia.org/tree (check m5) [10:56:03] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994420 (10Marostegui) We normally wait till all the patches are merged. Another question, you said you can handle this on your ow... [11:04:04] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3994420 (10jcrespo) > Do you have ALTER privileges All deployers have alter privileges, but they shouldn't use them except in an... [11:06:10] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3995848 (10jcrespo) Note, all the above is for the addition of the column, which is simple. The index changes need additional scru... [11:32:32] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3995921 (10Marostegui) [11:33:29] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3995922 (10jcrespo) [11:34:37] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3940891 (10jcrespo) This is for eqiad only, we can think in the future, with much much less priority, if we want to do that with codfw- it... [11:35:21] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3995925 (10Marostegui) >>! In T186321#3995922, @jcrespo wrote: > This is for eqiad only, we can think in the future, with much much less p... [11:42:01] will be deployed at 15:00 UTC I think https://gerrit.wikimedia.org/r/413375 [11:43:16] i guess it will be easy to realise if something breaks before the weekend [11:43:27] I don't think anything will break though [11:45:44] it is only ferm changes [11:46:00] of labs mainly [11:46:06] I prefer labs people to be around [11:46:17] yes I agree :) [11:47:24] should I test tendril changes one at a time? [11:48:19] you mean breaking that patch down in pieces? [11:48:54] I don't know [11:49:06] I think it was the Aria tables [11:49:17] but on the other side, I do not want to reboot it multiple times [11:49:27] Ah you are no talking about the ferm patch anymore XDD [11:49:37] no, sorry, my fault [11:49:46] haha np! [11:50:06] so, you mean tendril config changes? [11:50:13] yes [11:50:19] now, yes [11:50:34] I would try to see if it is Aria indeed or not [11:50:40] to confirm it [11:50:55] I can revert everthing except Aria [11:51:00] I am almost sure it was that [11:51:07] and only needs one restart [11:51:10] let's try that [11:51:21] there is another unrelated issue right now [11:51:33] https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1115&var-port=9104 [11:51:43] ^that should work, right? [11:52:03] maybe grant issue or prometheus configuration? [11:52:57] yeah that should work.... [11:53:00] let me see the grants [11:54:19] if it is grants, that is why I wanted to wait before cloning it [11:55:40] curl works, so either you fixed it or it wasn't that [11:56:47] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1 doesn't complain about the hosts, so probably missing config on puppet [11:57:13] see how it complains about db1011 [11:57:13] No, I was checking grants and they look in place [11:59:05] oh wait [11:59:32] ? [11:59:38] it is not in modules/role/files/prometheus/mysql-misc_eqiad.yaml [11:59:41] and db1011 is [11:59:53] add db2093 too [11:59:57] even if it fails now [11:59:57] oki [11:59:59] doing it now [12:00:09] we will have something running there, what exactly we don't know [12:01:17] this is the memory usage of db1115 https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?panelId=86&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=db1115 [12:06:37] this is a draft proposal: https://gerrit.wikimedia.org/r/413712 but I would need grafana not be more accurate [12:07:12] we'll see if aria+binlog works good [12:07:25] oh, and there is the standalone, which needs fixing [12:07:28] if aria fixes the issue we might be able to have binlog enabled [12:08:44] we need to change those files (prometheus) to be generated automatically [12:10:48] 400 iops [12:10:55] almost no disk reads [12:11:15] lots of bytes written, which will likely double with binlogs [12:11:46] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1115&var-port=9104&from=now-5m&to=now [12:12:06] \o/ [12:12:18] maybe we can increase the buffer pool more [12:12:27] I am not sure about toku changes [12:12:37] toku normally is more heavy on cache access [12:13:10] the question is how much resourcers we want to take, also thinking in the future (small prometheus installation) [12:34:25] I may deploy the updated patch after lunch [12:35:21] sure [13:10:17] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3996140 (10Marostegui) For s4 my suggestion is db1081. Reasons: the only non large server is db1064 which is sanitarium master so that is... [16:08:49] marostegui: a wild silver apears on tendril! it is super-effective! [16:09:31] ah! [16:09:31] haha [16:09:33] I see it now [16:13:49] the performance seem probably even better after the restart [16:13:59] so replication probably possible [16:14:56] \o/ nice! [16:14:57] that was my only blocker, feel free to clone it or whatever, but I definitely not going to do it today [16:15:14] No, no way I am doing it today either [16:15:25] :-) [16:15:31] binlog is enabled then, right? [16:15:32] I might do it on monday morning [16:15:34] yep [16:15:49] I do not expect you to do it [16:15:57] just I did't want you to do it today [16:16:02] haha yeah [16:16:13] because config was dubious still [16:16:14] But I start early, so maybe by 9am-10am it can be already in codfw :) [16:16:25] it is clearly an aria issue [16:16:37] maybe it gest used for temporary tables? [16:28:40] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3996599 (10jcrespo) [16:44:25] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#3996671 (10Marostegui) We could copy db1011's content to db1113 or db1114 which right now are not used and are just spares. [17:32:57] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema changes - https://phabricator.wikimedia.org/T188048#3996884 (10Tgr) >>! In T188048#3995848, @jcrespo wrote: > Note, all the above is for the addition of the column, which is simple.... [17:35:01] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema change for efficient count(*) handling - https://phabricator.wikimedia.org/T188048#3996886 (10Tgr) [17:35:33] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema change for efficient count(*) handling - https://phabricator.wikimedia.org/T188048#3994420 (10Tgr) [17:36:44] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema change for efficient count(*) handling - https://phabricator.wikimedia.org/T188048#3996894 (10jcrespo) Note I am not saying you should undeploy and redeploy, I was just justifying th... [17:45:22] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3996945 (10jcrespo) [17:45:27] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3996946 (10jcrespo) [17:45:32] 10DBA: Decommission db1011 - https://phabricator.wikimedia.org/T184703#3996944 (10jcrespo) [17:46:27] 10Blocked-on-schema-change, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog (Kanban): Deploy ReadingLists schema change for efficient count(*) handling - https://phabricator.wikimedia.org/T188048#3996949 (10Tgr) The other task is {T188120}; I'm not adding the schema change tag yet, if it cannot... [17:49:22] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3996962 (10jcrespo) a:05jcrespo>03RobH [17:50:01] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3978426 (10jcrespo) [17:52:29] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3996978 (10jcrespo) a:03RobH [23:17:26] 10DBA, 10CheckUser, 10MediaWiki-Special-pages: Investigation: Add old and new length columns to cu_changes - https://phabricator.wikimedia.org/T155734#3998126 (10TBolliger) [23:30:11] 10DBA, 10MediaWiki-Categories, 10Patch-For-Review: Increase size of categorylinks.cl_collation column - https://phabricator.wikimedia.org/T158724#3998166 (10TBolliger)