[04:13:40] 10DBA, 10MediaWiki-Database, 06Operations: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859216 (10fgiunchedi) [07:31:13] 10DBA, 10MediaWiki-Database, 06Operations: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859340 (10Marostegui) Thanks guys for taking care of this. A quick HW check reveals no issue with db1028, just to discard issues.... [07:39:27] 10DBA, 10MediaWiki-Database, 06Operations: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859343 (10Peachey88) [07:39:46] 10DBA: dbstore1001 duplicate entry on s4 commonswiki.recentchanges - https://phabricator.wikimedia.org/T152766#2859344 (10Marostegui) [07:46:22] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2859374 (10Marostegui) The server survived to the last alter. I have started the netcat transfer that always makes the server die. [07:51:08] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2859379 (10mmodell) Ok I've reverted the edits using the script provided by epriestley. @daniel, does everything look ok? [07:53:47] 10DBA: dbstore1001 duplicate entry on s4 commonswiki.recentchanges - https://phabricator.wikimedia.org/T152766#2859380 (10Marostegui) Looks like one of them was skipped and it broke again: ``` 161208 12:52:55 [ERROR] Master 's4': Slave SQL: Error 'Duplicate entry '436249985' for key 'PRIMARY'' on query 161208 12... [08:39:38] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2859395 (10Marostegui) Alter running on db1082 [08:41:56] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2859398 (10Marostegui) And it died after 50 minutes into the transfer. @Papaul can we get the RAID controller replaced by HP? it is one of the key components we have not replaced yet. [09:43:12] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2859446 (10Marostegui) db1082 is finished ``` root@neodymium:~# mysql -hdb1082 -A wikidatawiki -e "show create table revision\G" *************************** 1. row ***************************... [09:45:32] I cannot vote on gerrit [09:45:43] are you logged in? [09:45:53] they updated yesterday all sessions were reset I guess [09:46:10] and login with ucfirst didn't work for me, I had to login lowercase (username) [09:46:42] that worked [09:46:44] thanks [09:46:48] yw :) [09:47:20] jynus: oh you are already in, let's see what we need to do with dbstore1001 :( [09:55:32] so I was thinking about it and probably I created the issue during the restart [09:55:41] even if I did STOP SLAVE [09:55:52] mmm, why? [09:55:54] between that and the shutdown [09:55:58] non flushed changes? [09:56:13] some threads could have done start slave and then they stop uncleanly [09:56:28] that would match having extra rows [09:56:53] something that innodb + GTID would not allow [09:57:07] because things would be reverted properly [09:57:51] we have to a) get rid of events [09:58:01] b) get rid of tokudb [09:58:32] c) impolement gtid/table based replication [09:58:59] d) implement backups on dbstore2001 [09:59:08] a) and b) +10000 [09:59:26] I was actually thinking that next week I want to keep importing the left shards into dbstore2001 [09:59:34] now, this has no easy fix [09:59:35] and have all of them ready by the end of the week [09:59:43] because we can certainly delete the extra rows [09:59:56] but we do not know how many there are [10:00:03] in any case, we need to do something [10:00:17] because we are running out of relay space (see disk space warning) [10:00:51] oh yes [10:01:11] yes, we have no idea how many they are [10:01:28] I was checking the user, and it is a wmf employee, so maybe it was even a script [10:01:34] not only that [10:01:55] are you referring to my email? [10:02:14] no [10:02:24] to the user that was actually changing stuff in the db [10:02:49] ? [10:03:05] sorry, I do not follow you [10:03:12] the two duplicate entries are rows coming from an user that is a wfm employee [10:03:13] you mean like the db user? [10:03:17] according to the userid and all that [10:03:33] define coming? inserted? [10:03:44] or on-wiki user [10:03:45] yeah, the actual wikicommons changes [10:03:48] wiki-user [10:03:55] sorry I am not explaining myself too well this morning [10:04:03] ah, no, I belive that is irrelevant [10:04:04] it is actually non relevant to the discussion [10:04:10] exactly [10:04:15] that could be mere chance [10:04:24] I think it fits the restart theory [10:04:33] because it happened 24 hours after my restarts [10:05:01] and fits the same profile of what happened when a server crashed [10:05:09] before gtid [10:05:34] I actually thought about it [10:05:49] but remember I had issues to cancel the events last time? [10:05:52] true [10:06:18] maybe the events are terrible queued up [10:06:26] *terribly [10:06:37] so chances of this happening to dbstore1002…we do not know [10:06:41] it could happen or it cannot [10:06:48] it depends [10:07:09] or you mean, that actually happened? [10:07:20] no, it hasn't happened [10:07:27] probably no one according my theory [10:07:36] because if I run stop all slaves there [10:07:39] it actually stops them [10:07:48] ah yes, no events [10:08:54] in theory it should be just a few rows, so either of us should be checking each row individually and delete if duplicate [10:09:05] we must also check the previouls log events [10:09:12] because if transactions have been skipped [10:09:23] that could leave the db in an inconsistent state [10:09:36] that is my fear :( [10:09:43] that we might have more issues now [10:09:47] because of that skip [10:09:50] so, we were going to reconstruct [10:09:57] the slave anyway [10:10:12] because of the new disks + toku [10:10:16] yes [10:10:38] we should ask of those disks, in the meanwhile [10:10:47] I would setup up dbstore2001 [10:10:54] as you said [10:11:02] I can have that done next week [10:11:04] and start taking dumps of commonswiki [10:11:06] it should take 1 day per shard [10:11:11] I think that is already loaded? [10:11:17] yes [10:11:18] there instead of dbstore [10:11:19] s4 is there [10:11:36] that should be a few lines of puppet [10:11:50] and would be a good test for when we move all dumps there [10:11:56] while the disks are setup [10:12:19] actually [10:12:38] we could setup dumps to run there already of all existing dbs [10:12:38] Just checked: s1,s3,s4,s5 are on dbstore2001 [10:12:51] just not set to bacula [10:12:54] *sent [10:13:02] to avoid duplicate storage [10:13:36] I think the numbers above, with an automatic pt-table-checksum [10:13:48] are good goals for next quaarted,arent they? [10:13:54] indeed [10:13:55] (internal goals) [10:13:58] yes [10:14:03] That is a good one [10:14:15] Improving backups are always good goals I think :) [10:14:16] now, what do you think we should do now, knowing that [10:14:42] #1 nag robh/chris about the disks [10:14:43] should: move s4 backups to dbstore2001 and skip one by one the transactions and see what we get? [10:15:01] let's not skip them [10:15:07] lets delete the rows [10:15:17] when we skip transactions, we can skip more than a single insert [10:16:42] also, as I said on my email, on a gtid slave (not the case) that could mess up the gtid position [10:16:59] https://phabricator.wikimedia.org/T143874#2859495 -> asked rob [10:17:21] thanks [10:17:27] I can take care of the slave [10:17:35] so the rows that are complaining at the moment are: rc_id [10:17:35] 436249985 (which was skipped) and 436249921 [10:17:37] can you give a look at new grants [10:17:47] needed by labs [10:17:58] and apply those to all 5 slaves we have? [10:18:15] sure thing [10:18:21] I am trying to find the ticket for that [10:18:22] there is a ticket [10:18:27] yeah, same problem here [10:20:38] https://phabricator.wikimedia.org/T149933 [10:20:39] ? [10:21:13] nope [10:21:24] something about meta script or something [10:22:18] https://phabricator.wikimedia.org/T151570 [10:22:22] found it [10:22:57] he pinged me only that is why it was difficult to find, and it was buried under a long ticket [10:23:33] and the title is also not helping haha [10:26:36] 10DBA, 06Labs: Querying the logging table on labs is slow - https://phabricator.wikimedia.org/T131266#2859500 (10jcrespo) > Is there a way to change the behavior of the query planner to use the correct index (in this case, page_time) when running select statements on the logging_userindex table/view? There wo... [10:45:42] marostegui, ask me for context [10:46:09] meta_p is a database that only exists on labs, and chase needs to generate [10:46:22] he cannot do it if he doesn't have the grants [10:46:29] jynus: didn't want to disturb you :) [10:46:48] how can you disturb me? [10:47:06] if someone else can provide context, that offloads you a bit :) [10:47:07] so [10:47:14] he needs all privileges then [10:49:59] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2859517 (10daniel) 05Open>03Resolved @mmodell yes, looks great, thank you! [11:26:20] jynus feel free to treat my questions as background priority ;) am not actually blocked on anything, only curious about 'the right way' of doing things [11:26:20] which we can always change [11:27:37] YuviPanda, for me labsdb is a priority this quarter [11:27:45] it will avoid so much future pain! [11:28:09] yeah [11:28:47] I am 99% on the way to accurately populating the meta db for current accts [11:28:55] next step would be to replicate these to the new labsdb :) I'll ing people before I do that [11:30:32] the account are more important, though [11:31:08] replicate the accounts? that's what I meant, sorry [11:31:42] oh, sorry [11:31:56] I got confused between [11:31:58] meta db [11:32:02] and the meta_p db [11:32:11] (not to mention the metawiki db) [11:32:22] not confusing or anything at all [11:32:24] :-) [11:34:22] haha :D [11:34:41] metawiki_, meta_p, metadb [11:35:08] next can be: wikimeta [11:36:10] also metawiki_p [11:36:13] wikidatameta, wikimetadata [11:36:18] and s56789_metawiki [11:36:31] ^that is invented, but it could exist [11:39:11] 10DBA, 13Patch-For-Review: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644#2859557 (10Marostegui) db1087 finished ``` root@neodymium:~# mysql -hdb1087 -A wikidatawiki -e "show create table revision\G" *************************** 1. row ***************************... [11:46:23] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2859564 (10yuvipanda) As part of this, I've fixed all tools that didn'... [11:57:35] 10DBA, 10MediaWiki-Database, 06Operations: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2859571 (10jcrespo) @kaldari I do not see long-running script being referenced on https://wikitech.wikimedia.org/wiki/Deployments#... [12:00:19] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2859598 (10epriestley) Thanks! [13:11:25] marostegui: mornin :) so re: https://phabricator.wikimedia.org/T151570#2859508 it's only on 1001/1003 as it is entirely managed by this script https://gerrit.wikimedia.org/r/#/c/325949/ [13:11:29] which seems to have not been run a long time [13:11:42] like since whenever VE was not the default long [13:11:52] chasemp: morning! [13:11:55] and that creates teh db and tables etc entirely [13:12:00] riiight [13:12:05] do you need ALL priv? [13:12:05] and it was always run as root I imagine [13:12:08] please [13:12:17] ok, give me a sec, where can we test it? [13:12:27] i mean, in which server you want it first? [13:13:38] 1001/1003 are the ones I'm looking at now, but 1009/10/11 are not far behind and I can start there so kind of your choice :) maybe 1009 is best [13:13:44] won't effect any of the existing data [13:13:53] and the perms change should be the same across afaiu [13:13:59] yes, totally [13:14:19] I meant create them somewhere, test to make sure it works or not further tweaks are needed and then deploy across all of them :) [13:14:40] gotcha, 1009 then? DBAs choice :) [13:14:46] sure [13:14:48] give me a minute [13:16:05] done [13:16:41] as an aside marostegui I have gotten feedback from a few bot (esp anti vandal) maintainers that they look at meta_p to get an authoritative view for db's on the labsdb's, I had wondered if this was even actively used being so stale [13:16:47] thank you for taking care of that, marostegui [13:16:48] ok I'll give it a whirl and report back on 1009 [13:17:03] jynus: my pleasure! you are overloaded already! [13:17:05] chasemp, let me give you a link [13:17:13] note my smile :-) [13:17:44] chasemp ah thanks for giving me some history! Appreciate it [13:17:45] chasemp, https://phabricator.wikimedia.org/T142759#2823498 [13:18:14] and my proposal to move such a functionality to mediawiki https://phabricator.wikimedia.org/T142759#2824239 [13:18:38] "there are microservices here and there that are basically unmaintained, probably undocumented and not puppetized/version controlled" [13:18:59] that is a problem that should be done on mediawiki, not maintained by us [13:19:44] interesting, onboard for that proposal totally but here https://gerrit.wikimedia.org/r/#/c/325949/ it's at least functioning as expected for the moment I believe [13:19:52] having said that, we probably have to keep it around just by searching the meta_p entries on phabricator [13:19:52] I would be happy if that's the last time I hve to touch it for sure [13:19:59] yes [13:20:42] oh, I support doing that that [13:20:46] I don't entirely understand why it's doing what it's doing how it's doing it, as in, even if this was the plan (some meta table as an overview of available dbs) [13:21:13] why do it this way? I would probably have just translated the output of the api query in it's entirety into a table on labsdb rather than the current [13:21:36] but anyhoo, I'll try to keep track of that thread for when we can kill it off [13:21:50] chasemp: I remember some amount of 'because toolserver had such a table' [13:22:00] that should be an extension, e.g. on meta [13:22:22] creating a table and labs users can either query the production api or read the production table copy [13:22:40] the issue is we all are too saturated to do that :-) [13:22:59] tread water it is then my friends :D [13:23:12] but maintaining bad ideas is something we should not encourage [13:23:18] :-/ [13:23:37] anyway, I am complaining the the only people that probably agree with me [13:23:42] sorry about that [13:24:11] no worries, a bit of preaching to choir gets the congregation going :) [13:24:27] also thank you for taking care of it [13:27:09] marostegui, the regular expression on the check_mariadb check has a bug [13:27:40] what? [13:27:49] icinga check [13:28:13] the service one? [13:28:16] It says something like "Error 'Duplicate entry '436249921' for key 'PRIMARY'' on query. Default database: 'commonswiki'. Query: [snipped]2 [13:29:00] i haven't counted with the fact that the query can be cut by the server if it is too long [13:29:15] ah yes, I saw that and I thought it was intended [13:29:24] so it needs to delete from Query: till the end [13:29:32] not until the last ' [13:30:00] (https://gerrit.wikimedia.org/r/#/c/326113/ will let me differentiate between 'old' labsdbs (with no ROLE setup) and new ones (with ROLE) setup) [13:30:07] I'll soon have user accounts on the new labsdbs! [13:30:08] if you finish the labs tasks, go have a look at the check [13:30:09] * YuviPanda excited [13:30:14] marostegui: seems to have worked out :) `root@labsdb1009:~# python maintain-meta_p.py --databases fiwikivoyage --debug` and then `MariaDB MARIADB labsdb1009 meta_p > select * from wiki \G` [13:30:16] jynus: sure [13:30:28] can you translate that to the other labsdb's someimte today possibly? [13:30:36] chasemp: will do right now [13:31:51] 'legeacy' probably a typo, YuviPanda ? [13:32:16] it is [13:33:49] jynus: thanks for spotting [13:34:03] chasemp: done in 1001,1003,1010 and 1011 [13:34:29] marostegui: great, I'll run through and report back. Gotta take the kid to school so will be a bit, thanks [13:37:33] see my heads up for analyics mysqls [13:41:17] chasemp, YuviPanda, DBA, given that we are all here... when you have a sec could you check also T152767 please? I added a quick fix this morning but will be nice to clean it a bite more [13:41:18] T152767: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767 [13:42:07] volans: not to play b'cratic ping-pong, but andrewbogott is the person who usually works with that, and he was doing something to it yesterday as well [13:42:26] ok, then I'll ping him :) [13:42:29] no problem [13:45:52] volans, there was some playing around 2 days ago [13:45:54] with such key [13:46:19] look at the logs (not on -operations) for wednesday [13:46:41] yes I knew that and I saw you and joe fixing it for prod [13:47:25] given was a labs db-related key and you where all here I thought might be a good time to bother all of you with one stone :-P [13:47:45] ˜/jynus 14:29> so it needs to delete from Query: till the end -> that is what is being done right now (apart from the "2"): or key 'PRIMARY'' on query. Default database: 'commonswiki'. Query: [snipped]2 [13:47:49] or you meant the first "query" [13:49:12] no, it needs to delete everthing, including the 2 [13:49:34] so instead of waiting for a "'", til $ [13:49:38] or something like that [13:49:50] (we have to try on dbstore1001 first [13:50:02] yeah yeah, just getting familiar with the script :) [13:51:37] that was actually my only resons to ask you about that [13:52:02] haha good good [13:52:58] that and becaue I am doing the analytics restarts [14:33:47] eventlog: Inserted about 2000 rows [14:33:53] sorry, wrong channel [15:23:24] 10DBA: Change dbstore1001 delayed slave to be a direct slave of the eqiad masters - https://phabricator.wikimedia.org/T133386#2859940 (10jcrespo) 05Open>03Resolved a:03jcrespo Just did this a few days ago. [16:57:52] jynus: chasemp am going to run code that creates users in labsdb1009/10/11 shortly. [16:58:00] (can also remove them if we fuck up) [16:58:14] keeps state in the backing db, and is idempotent [16:58:15] just a fyi, will poke when done [16:58:24] YuviPanda: good luck :) [16:59:43] YuviPanda, let me create a backup of the grants [16:59:50] very quickly [16:59:56] so we can revert them easier [17:00:25] (not that I trust you, but I want a place where non-user accounts are recorded) [17:00:56] jynus: sure! that sounds great [17:24:21] YuviPanda, done, sorry [17:24:48] I've copied the original grants to neodymium [17:25:00] and I may puppetize that somewhere [17:25:48] jynus: cool! [17:25:56] I'm going to eat some dinner and then play with it [17:26:19] I will be around if you need additional grants [17:51:48] 10DBA, 13Patch-For-Review: dbstore1001 duplicate entry on s4 commonswiki.recentchanges - https://phabricator.wikimedia.org/T152766#2860145 (10jcrespo) p:05Triage>03High a:03jcrespo I dropped some rows, still, there are some rows that are only on dbstore, but not on s4-master. It was 11 at first, then at... [17:54:49] 10DBA, 10Edit-Review-Improvements-RC-Page, 06Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016), 13Patch-For-Review: Implement functionality for RC page 'Experience level' filters - https://phabricator.wikimedia.org/T149637#2758602 (10tarlocesilion) Please rethink names used in this filter. "Newcomer"... [19:10:49] ah! [19:10:56] my null byte at the end of the password field [19:10:57] was because [19:11:01] I was an idiot! [19:11:07] the mysql password hashing mechanism [19:11:08] is [19:11:16] * + sha1(sha1(password)) [19:11:17] and not [19:11:25] sha1(sha1('*' + password)) [19:11:26] lol [19:11:27] ol [19:11:27] o [19:11:33] k [19:24:24] 10DBA, 10MediaWiki-Database, 06Operations: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2860428 (10kaldari) @jcrespo: Thanks for reminding me to list long-running script runs on the calendar. I had completely forgotten... [19:32:37] 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2860443 (10Cmjohnson) The thermal paste on cpu1 was nearly non-existent. Cleaned both CPU's and re-applied paste. After booting the server, the disk in slot 2 failed. A ticket has been created w... [20:03:57] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2860513 (10yuvipanda) When I finally got to actually creating the user... [20:04:14] jynus: marostegui if anyone is still around, am blocked on https://phabricator.wikimedia.org/T149933#2860513 now [20:04:25] np if not, I can check back on monday [21:56:06] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2860727 (10chasemp) A small note owed as part of a previous conversati... [23:02:07] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2860880 (10jcrespo) I have created the role: `CREATE ROLE labsdbuser;... [23:07:25] 10DBA, 10MediaWiki-Database, 06Operations: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php - https://phabricator.wikimedia.org/T152761#2860886 (10jcrespo) @kaldari, this doesn't have to be synchronous. Please schedule a time with some advance notice on the Deployme... [23:11:43] 10DBA, 06Operations, 10ops-eqiad: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2860890 (10jcrespo) Thank you very much. I will reset the RAID when the new disk gets installed (if I can handle the bios interface). A new disk failing would explain the previous RAID I/O error.