[03:54:43] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10Nihlus) [05:59:49] 10DBA, 10Wikimedia-Site-requests: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed - https://phabricator.wikimedia.org/T216444 (10Marostegui) 05Stalled→03Open When do you want to do this? [06:01:06] 10DBA, 10Wikimedia-Site-requests, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, 10User-MarcoAurelio: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 (10Marostegui) a:05MarcoAurelio→03Tgr >>! In T215107#4985384, @Tgr wrot... [06:01:14] 10DBA, 10Wikimedia-Site-requests, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, 10User-MarcoAurelio: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 (10Marostegui) 05Open→03Resolved [06:01:32] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10Marostegui) p:05Triage→03Normal When do you want to do this? [06:25:20] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter, 10Patch-For-Review: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) db1119 has been altered, only db1089 left on s1. I have not seen any new slow queries popping up for enwiki hosts on tendril, in fact, I am seei... [06:49:52] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) [07:18:09] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) a:03Marostegui [07:56:36] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10alanajjar) > When do you want to do this? if you are here @Marostegui we can do it (then T216444) [07:57:06] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10Marostegui) Go for it, please share the progress URL once you've started it [08:01:42] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10alanajjar) >>! In T217222#4987345, @Marostegui wrote: > Go for it, please share the progress URL once you've started it Thanks @Marostegui- [[https://meta.wikimedia.org/wik... [08:08:55] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10alanajjar) 05Open→03Resolved a:05Nihlus→03alanajjar [08:09:00] 10DBA, 10Wikimedia-Site-requests: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed - https://phabricator.wikimedia.org/T216444 (10alanajjar) @Marostegui here now? [08:09:23] 10DBA, 10Wikimedia-Site-requests: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed - https://phabricator.wikimedia.org/T216444 (10Marostegui) Go for it [08:09:27] 10DBA, 10Wikimedia-Site-requests: Global rename of HeavyTony → QTHCCAN: supervision needed - https://phabricator.wikimedia.org/T217222 (10alanajjar) Thanks @Marostegui and @Nihlus [08:11:28] 10DBA, 10Wikimedia-Site-requests: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed - https://phabricator.wikimedia.org/T216444 (10alanajjar) [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Takhirgeran_Umar|Progress url]] [08:25:38] 10DBA, 10Wikimedia-Site-requests: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed - https://phabricator.wikimedia.org/T216444 (10alanajjar) 05Open→03Resolved a:05Nihlus→03alanajjar [08:26:01] 10DBA, 10Wikimedia-Site-requests: Global rename of Дагиров Умар → Takhirgeran Umar: supervision needed - https://phabricator.wikimedia.org/T216444 (10alanajjar) Thanks @Marostegui and @Nihlus [09:48:55] I am going to upgrade db1124 (one of the sanitariums) to 10.1.38 either today or tomorrow, any objection? [10:14:03] well, no objections, although I wanted to do labsdb easier first [10:14:12] *earlier [10:14:22] but if it is easier for you, no issue [10:14:30] we did all the sanitarium masters before the sanitariums [10:14:39] yeah, that's true [10:14:40] so I wanted to go down as we already did that (for whatever reason) [10:14:53] to avoid having 10.1.38 -> 10.1.37 -> 10.1.38 [10:22:28] +1 [10:52:45] marostegui: ok I may need some ideas, as I am scratching my head https://phabricator.wikimedia.org/P8135 [10:53:04] let's see [10:53:23] I think we have seen that issue before [10:53:29] ? [10:54:19] that message is sent on a pymysql.err.OperationalError, pymysql.err.InternalError [10:55:02] I recall a similar issue with db1115 months ago (where it got full of Unkwons on icinga) despite being able to connect to it manually [10:55:16] And I think we narrowed it down to a sudden spike on mysql [10:55:17] (or load) [10:55:32] but it is happening right now [10:55:54] So it is receiving access denied [10:56:06] recvfrom(3, "4\0\0\2\377\242\6#28000Access denied for user 'nagios'@'localhost'", 8192, 0, NULL, NULL) = 56 [10:56:19] which I believe it is exactly the same thing we saw the last time [10:56:42] but I can connect as nagios@localhost on command line [10:58:19] maybe it is not connecting as the nagios unix user? [10:58:26] it is [10:58:31] recvfrom(3, "4\0\0\2\377\242\6#28000Access denied for user 'nagios'@'localhost'", 8192, 0, NULL, NULL) = 56 [10:59:55] and it is using the right socket: connect(3, {sa_family=AF_UNIX, sun_path="/run/mysqld/mysqld.sock"}, 25) = 0 [11:00:11] mmm [11:00:26] I think it is not using the socket [11:00:46] recvfrom(3, "Y\0\0\0\n5.5.5-10.1.37-MariaDB\0\201\27\305\21]Vul=-hD\0\377\377\10\2\0?\340\25\0\0\0\0\0\0\0\0\0\0Zxxxxxxx+\0mysql_native_password\0", 8192, 0, NULL, NULL) = 93 [11:00:55] sendto(3, "G\0\0\1\r\242+\0\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0nagios\0\0zarcillo\0mysql_native_password\0", 75, 0, NULL, 0) = 75 [11:02:40] so my suggestion is that it is connecting as nagios, but maybe the unix user is not nagios anymore because any puppet or package update [11:04:59] so it doesn't use socket based authentication [11:05:11] for that or other reason [11:05:32] because AFAICS socket works [11:05:36] yeah [11:05:43] but it looks like it is using password [11:05:56] we can try to enable password auth and see if it recovers, to confirm [11:06:09] unix_socket | ACTIVE [11:06:31] No, I mean it looks like, from strace, that it is trying to connect using password [11:06:38] not too worried about this, we will see what it is eventually, but if it is something related to puppet or update [11:06:44] it will do a page storm [11:06:51] on the production hosts [11:07:10] but there have been no changes as far as I know, right? [11:07:24] marostegui: and I am guessing you didn't touch the server of the code, right?= [11:07:26] he [11:07:39] uh? [11:07:48] the server of the code? [11:07:57] server *or the code [11:08:09] I didn't change anything nope :) [11:08:18] and certainly not yesterday [11:08:57] let me see if some package was updated yesterday [11:09:40] intel-microcode on 25, so nope [11:10:20] this is buffling isn't it? [11:10:34] do we have a project to report issues with replicas? [11:10:34] *baffling [11:10:47] hauskatze: which kind of issue? [11:10:50] to tag: T217235 [11:10:51] T217235: Problem with dewiki_p - https://phabricator.wikimedia.org/T217235 [11:10:53] hauskatze: we used to, but it was archived and we stopped having one [11:11:05] That should be fixed already [11:11:06] I think [11:11:07] Let me check [11:11:08] ah, replicas, not replication [11:11:23] hauskatze: but the project is data-services [11:12:05] "After a few minutes the problem vanished." probably someone was doing maintenance [11:12:06] I should add an alias for that project so 'replicas' is identified as data-services [11:12:27] wikireplicas ok, but replicas is confusing [11:12:39] there are replicas somewhere else than the wikireplicas [11:13:07] wikireplicas added then https://phabricator.wikimedia.org/project/manage/2874/ [11:13:11] hauskatze: normally for that kind of issues data-services [11:13:56] jynus: Going back to the user problem, maybe we can try to enable password auth for nagios and see if it indeed recovers, so we can confirm it is trying to authenticate with password (for some weird reason we don't understsand yet) [11:14:08] I am ok with it [11:14:17] you or me? [11:14:23] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) s5 eqiad progress [x] labsdb1011 [x] labsdb1010 [x] labsdb1009 [x] dbstore1003 [x] dbstore1002 [x] db1124 [x] db1113 [x] db1... [11:14:31] ty all :) [11:14:31] jynus: can you do it? [11:14:35] yes [11:14:38] thank you [11:15:43] hauskatze: https://phabricator.wikimedia.org/T217235#4987654 [11:16:13] marostegui: aha [11:16:24] I was just looking for projectless tasks [11:16:27] :) [11:16:53] yeah, the issue was fixed after a few minutes (as soon as the change replicated) [11:16:57] they were fast catching it! [11:17:03] which is great :) [11:18:07] so this is problematic, none of the mariadb commands to hadle that works, so I will delete and recreate the user [11:18:53] probably documentation for mariadb 10.3 that doesn't work on older ones [11:21:11] so now it works (maybe?), although it errors out [11:21:41] so it was clearly trying to authenticate via password as the strace was kinda showing? [11:22:05] not sure yet, I want to see why it errors [11:25:59] oh, I see, there is a bug on the code [11:26:07] but I think that is irrelevant [11:26:45] what does it do? [11:27:10] so the errors is that it expects finished backups to not have a null on the size [11:27:30] and probably while testing, i got one inserted by error [11:27:35] I can fix it on the database [11:27:37] ah, so nothing really to do with the connection weirdness [11:27:48] but hey, good that a bug was found! [11:28:04] and then later on the code [11:30:35] all last backups have NULL on size, so last deploy may have done that [11:41:41] please check https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/493208/1/modules/profile/files/mariadb/check_mariadb_backups.py [11:41:53] let me see [11:43:03] that will return a bunch of criticals right now, but we can check that later [14:01:07] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter, 10Patch-For-Review: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) No more slow queries on enwiki apart from db1089 which isn't altered, so I am going to go ahead and alter that one and the master to get enwiki... [14:21:17] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter, 10Patch-For-Review: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) [14:26:09] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter, 10Patch-For-Review: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) s7 comes next as it has shows often slow queries there too for arwiki and eswiki [14:36:53] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter, 10Patch-For-Review: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) [14:51:50] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) [15:04:35] I guess the dbstore role is to be removed soon? [15:28:00] sorry I was in a meeting [15:28:01] yep! [15:28:03] Very soon :) [15:29:46] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) [15:39:14] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter, 10Patch-For-Review: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) s7 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [x] dbstore1002 [] db1125 [x] db1116 [] db1101 [] db1098 [] db1094... [16:04:51] I am doing a dump of s1, checking the issue got fixed [17:01:57] jynus: Good evening, are you ready for me to start running the Echo cleanup script? [17:02:08] (I see you're dumping s1, but this only touches x1) [17:02:46] yes, no problem [17:03:01] we had some problem with metadata, but backups are ok [17:05:17] OK it's running [17:08:00] I can see the deletes going through: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1069&var-port=9104&from=1551283666536&to=1551287266536 [17:09:50] no errors or lag issues [17:14:47] 10K rows deleted per second, which is a nice rate, not too high, not too low [17:26:32] I am going to take a break, please have someone call me if something strange happens [17:38:54] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) a new motherboard arrives tomorrow 28/2/2019 to be replaced. [17:43:54] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Marostegui) @Cmjohnson remember this host has MySQL down already, so you can just power it off yourself whenever you are ready for the mainboard replacement. Thanks [19:28:41] manuel, for tomorrow: the new backup process fixes the bug (backup_files has a new column), should we redo al backups, run the postprocessing, or just leave it as is, as all backups succeded anyway (except codfw-s8 which is still ongoing) [19:55:38] The Echo deletion process is still chugging along nicely, it's about a quarter of the way through enwiki now (15M / 60M deleted) and it's finished everything that comes before enwiki in the alphabet [19:56:33] please when finished ping me on the ticket to check for potential table defragmentation [20:14:49] Will do