[08:16:03] 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3982320 (10Marostegui) The failover is now done. db2033 is now a x1 slave along with dbstore2001, dbstore2002. [08:19:12] 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3982324 (10Marostegui) 05Open>03Resolved a:03Marostegui [10:12:08] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#3982668 (10Marostegui) [10:12:20] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#3982670 (10Marostegui) [10:12:39] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3982671 (10Marostegui) [10:50:29] Happy Monday, I'm planning to enable xkill on commons that might cause wbc_entity_usage table to grow (or shrink, strangely it's not predictable but for most wikis it shrinks) but that's needed as it reduces the job queue. If it explodes (which hasn't happened for any wikis and I made several sanity checks to prevent it from exploding) it will take days to show and I'm monitoring it all the time and will turn it off if it goes out [10:50:29] of hand [10:51:09] just one thing: wbc_entity_usage is a rather small table, how much growth is okay for that table? [10:52:01] Amir1: Morning, thanks for the update. That table is 14G, it is hard to say what is okay or not okay in terms of space. Obviously, if it grows 100% it is bad :-). But there is not much prediction we can really say. What would you be monitoring? [10:54:06] marostegui: the number of rows in that table and number of rows where eu_aspect = X [10:54:26] https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage-project?orgId=1 [10:54:45] practically this but live (this graph gets updated once a day) [10:55:10] as I said it most wikis it shrinks: [10:55:10] https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage-project?orgId=1&var-project=arwiki&var-project=cawiki&var-project=cswiki&var-project=dewiki&var-project=elwiki&var-project=eswiki&var-project=fawiki&var-project=hewiki&var-project=huwiki&var-project=iawiki&var-project=jawiki&var-project=kowiki&var-project=ptwiki&var-project=rowiki&var-project=ruwiki&var-project=trwiki&var-project=ukwiki&var-project=viwiki&var-project [10:55:10] =wikidatawiki [10:55:29] nice [10:55:51] e.g. ruwiki was cut to half. It means they wrote the most lua code possible though [10:56:04] As I said, even if it goes terribly bad and the table doubles its space we are talking about of 30G, which is bad but it won't bring stuff down. [10:56:05] *the most horrible [10:56:21] So we have that safety net there :) [10:56:47] We'll need to optimize (following your advise on that task) [10:58:13] (meeting now, sorry!) [10:58:14] marostegui: that much of growth is impossible, the total number of rows is 85M now, if they wrote the best lua code possible (which I highly doubt) it will gets increased to 110M rows [10:58:23] not more [10:58:31] have fun :D [10:58:36] Amir1: Yeah, I was just thinking about the worst possible scenario from a disk space point of view [10:59:35] cool [12:57:06] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3983257 (10Marostegui) a:05Anomie>03Marostegui [13:43:44] db2030 disk [13:44:28] I see… [13:45:55] We'll soon have the task I guess [14:20:44] I don't think the query killer is working on labsdb1009 [14:20:45] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s3 - https://phabricator.wikimedia.org/T167973#3983573 (10Andrew) (quoting for future reference) ``` Jaime Crespo I would personally export a copy of labswiki, import it on each s5 node disa... [14:26:04] there are at least to heavy queries continuing with longer query times than the ones used [14:27:12] I think there is a problem with our filtering [15:24:06] was it set to 10 minutes? [15:24:45] no, it seems not to kill all queries [15:25:01] I know why, I cannot see the right synax [15:25:07] do you have 5 minutes to try it? [15:25:37] yep [15:25:43] the problem is the --match-command Query [15:25:58] it doesn't match queries in "Execute" state [15:26:05] it would be nice to know that is Execute [15:26:09] *what [15:26:42] I have no idea, are those parameter-queries? [15:26:56] and make the query killer kill both [15:27:19] Why don't we use —match-command Execute instead? [15:27:20] when I changed it to '(Query|Execute)', I think I killed more than I wanted [15:27:25] Ah right [15:27:29] you tried it already [15:27:31] so basically [15:27:49] that is what I wanted, but I cannot make it work- but I didn't spend more than 20 seconds on it [15:28:11] if you have 5 minutes to look at the man or whatever [15:28:22] yeah, I'll take a look now [15:28:25] maybe testing on a non production host or something [15:28:51] FYI, I setup labsdb1010 to receive 50% of the load [15:28:57] I am pupetizing it now [15:29:04] Ok, we still have to rebuild it [15:29:46] I know, this is was a test [15:29:52] to have it ready by then [15:29:58] btw, db2030 will be decommissioned? do you know from the top of your mind if it is included or it is only servers < db2030 I always forget [15:30:19] I am asking to see if it is worth replacing the broken disk or just get a large server and place it instead of db2030 [15:30:23] and throw it away [15:30:24] I was asking myself the same question, needs capex review [15:30:29] haha [15:30:30] I think [15:30:32] I will check [15:30:34] it wasn't planned [15:30:42] but according to capex it should [15:30:46] or something like that [15:30:50] let me check now [15:32:05] I think it goes away because it sayw db2016-db2030 and then db2031 also needs to go away [15:32:08] so db2030 has to go away [15:32:34] but I am not sure if we said we were going to do it on the goal [15:32:41] which is not that important [15:32:54] no, it is mostly, should we change the disk now? [15:32:59] I think we should really [15:33:04] change it? [15:33:13] what was the one with the broken bbu? [15:33:14] the broken raid one [15:33:27] no, db2030 -> degraded raid, db2033 -> broken BBU [15:33:30] mmmm [15:33:44] so are you suggesting changing it or removing it? [15:33:53] removing the host? [15:34:06] and more important, what is it hosting now? [15:34:13] m5 [15:34:18] only codfw slave for m5 [15:34:26] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3983773 (10Volans) [15:34:51] we should have enough with the spares of codfw [15:34:55] So I think as we are not going to replace that host _right now_, we could just replace the disk. To avoid the risk of another disk failing and losing the data [15:34:59] We do have spares yep [15:35:32] replace those 2 with db2037 and db2044 [15:35:41] "we shoud..." [15:35:48] then forget [15:36:09] For some reason I thought m5 was bigger, it is tiny [15:36:14] yes [15:36:15] I can probably try to do it this week [15:36:22] pfff [15:36:28] And decommission it [15:36:38] I do not think it is that important [15:36:49] I was talking about the decision, not the implementation [15:37:00] So then we should replace the disk, if another one fails we can lose all the data [15:37:10] That was my main point, replace the disk and do the replacement with no rush [15:37:30] if there is one spare onsite, sure [15:37:38] I will ask papaul [15:38:06] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3983793 (10Marostegui) p:05Triage>03Normal a:03Papaul @Papaul if you got spares, could you replace it? Thanks! [15:39:07] hola marostegui & jynus [15:41:53] Hola Hauskatze [15:42:06] I wonder why the kill isn't working on labs, I just did a test on a non core host and worked fine [15:42:16] could be the user regex maybe? [15:42:19] let's see [15:42:39] marostegui: the kill works, but only for queries [15:42:45] not for non-queries [15:43:05] probably they are prepared statements or something [15:43:43] yep [15:45:16] look at labsdb1009 processlist [15:45:33] most of the busy processes are Query [15:45:39] but there is at leasy one Execute [15:45:52] | 3708 | Copying to tmp table [15:45:56] So, looks like --match-command 'Query|Execute' [15:46:06] that is what I tried and failed [15:46:08] I just tried it and that query was killed (only that one) [15:46:53] and then we have to reinit all queries like that on the other hosts [15:47:10] paste the command somewhere to puppetize it when there is time [15:47:32] Let's leave the 'Query|Execute' for a few days on 1009 only [15:47:36] To see if the problem stops [15:47:44] ok, but please copy it somewhere [15:47:47] yeah [15:47:50] I think there was a ticket open about it [15:47:52] going to do so on the task [15:47:52] yeah [15:47:52] or we will forget [15:47:57] no worries I will comment there [15:47:58] thanks [15:48:15] I asked not to do the whole thing, just because I was unable to do it [15:48:25] and needed to move on [15:48:29] Of course! [15:48:36] It might not work, we'll see :) [15:49:00] my regex actually worked [15:49:06] but it kille too many queries [15:49:18] we' ll see how it works now [15:49:31] leave it as it is logging what it kills [15:49:38] to revie it later [15:49:46] this and the proxy changes [15:49:55] was because labsdb1009 was getting overloaded [15:50:29] I created https://gerrit.wikimedia.org/r/#/c/412721/ [15:50:36] but the syntax is wrong [15:50:57] if you have better idea about the right yaml/each function, please help [15:55:46] It is now killing wrong connections indeed [15:55:50] KILL 110040125 (Execute 4 sec) [15:57:24] let me try some other approaches [15:59:19] aha, so it wasn't just me :-) [16:00:11] Yeah, Execute basically kills any query being execute, looks like it doesn't parse the --busy-time [16:00:25] --busy-time [16:00:25] type: time; group: Query Matches [16:00:25] Match queries that have been running for longer than this time. [16:00:28] The queries must be in Command=Query status. This matches a [16:00:31] query's Time value as reported by SHOW PROCESSLIST. [16:00:33] :) [16:05:38] I think we know why it doesnt work: https://jira.percona.com/browse/PT-167 [16:05:58] So that matches what I saw, that —busy-time was being ignored [16:09:08] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3983838 (10Marostegui) We are seeing that some queries on `Execute` state are not being killed. Playing around with the current way of running pt-kill: ``` pt-kill --print --kill --victi... [16:10:38] marostegui: so you find the issue [16:10:49] that is a good step, now we have to find a workaround [16:12:19] we could fork pt-kill, the same way we had to fork pt-table-checksum and pt-online-schema-change :-) [16:12:25] but I would prefer not to [16:12:32] Or install the new version XD [16:12:41] oh, it is fixed? [16:12:50] I am going to try on my "lab" [16:13:04] As per the jira ticket it is fixed on 3.0.5 [16:13:05] which version do we have on stretch? [16:13:13] 2.2 [16:13:18] ugh [16:13:32] we can apply the patch to an older version, too [16:13:48] or package the latest versions everywhere [16:14:18] for now, download the latest version somewhere and just run it? [16:14:24] just the single script [16:14:28] yeah, that is what I am doing :) [16:14:32] want to confirm it works [16:14:35] we can add it to our puppet, it is a single file [16:14:37] sure [16:16:17] I was thinking of deploying https://gerrit.wikimedia.org/r/#/c/412721/ [16:16:40] worse case scenario, puppet breaks, but doesn't affect the proxy state [16:30:20] So it still doesn't work on 3.0.6 [16:30:44] lol [16:31:27] did you test manually applying the patch submited by the user? [16:31:40] No, I tested the new version released by percona XD [16:31:43] pt tools are easy to hack [16:31:49] they are just horrible perl scripts [16:31:51] yeah, i might try the patch [16:53:41] tendril *may* have the same issue, I do not see long queries on the slow query report [16:53:57] maybe it assumes all are queries and not "executions" [16:53:58] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3983899 (10Marostegui) The bug report states that it has been fixed on 3.0.5, but 3.0.6 still doesn't work. I have applied what the user suggested as a patch and looks like it works fine... [16:54:04] while I saw some on show processlist [16:54:32] try reporting upstream? [16:54:43] I DID [16:54:46] sorry, my cat [16:54:47] XDD [16:54:58] she just pressed the caps key [16:55:11] please share the link/add upstream tag on issue [16:55:16] I would like to add pressure too [16:55:26] I think I did? [16:55:34] https://phabricator.wikimedia.org/T183983#3983838 [16:55:37] https://jira.percona.com/browse/PT-167 [16:55:40] is the old one [16:56:40] did you open a separate one or something? [16:56:51] No, I commented on the PT-167 [16:57:58] oh, I cannot see it [16:58:13] It says "There are no comments yet on this issue. " to me [16:58:14] You cannot [16:58:18] Because i didn't press enter [16:58:18] XD [16:58:21] :-) [16:58:23] Just did [16:58:44] I don't have grants to reopen the task [16:58:47] which is what I wanted [17:01:53] So that workaround is really looking pretty good :), I can see executes being run without being killed :) [17:01:58] so let's leave it for a few days [17:02:30] the load balancing seems puppetized, more or less working [17:02:38] nice! [17:02:44] I do not want to restart the proxy without failing itself [17:02:49] but the templates look good [17:02:55] on the server [17:03:16] I am going to call it a day [17:04:26] güd [17:04:42] gud bay [17:47:43] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3983957 (10zeljkofilipin) @EddieGP I would also recommend asking for a... [19:15:33] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3984136 (10Marostegui) This host has completely failed - looks like (as per my chat with @Papaul that there were two disks blinking badly on the server chassis), the one replaced was not yet detected by mega... [19:30:21] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3984170 (10Marostegui) Also we could remove 2x160GB from s6 with a large server and use one of those 160GB ones for this replacement [21:28:08] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3984342 (10EddieGP) >>! In T176754#3983957, @zeljkofilipin wrote: > @Ed...