[04:57:50] banyek|away: db1124 is sanitarium, so not nice to have it broken as it affects labs, but not super urgent. It was expected it to break during the fixes [05:17:46] Yes I just thought it at the end but the alert was almost new so I was like better notice needless, than not [05:17:53] Just woke up [05:25:22] the backups failed for s8, it was a privileges issue [05:25:24] I am fixing it [05:38:42] marostegui did you create the right user? [05:39:20] jynus: yep [05:39:30] it is now flowing [05:40:07] into /srv/tmp.s8, once that is done, I will move the latest one to archive, and this one to latest [05:40:10] does that sound ok? [05:40:48] I am not 100% sure if that is the correct procedure once a manual backup is taken [05:40:57] you are generating it as root [05:41:02] you will need to chown it [05:41:07] Yeah, I was planning to do that [05:41:15] (as I realised I did it as root instead of dump) [05:43:51] the alerts work :-) [05:43:54] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update from 17th at 19:04 from Jaime: all tables except wb_t... [05:44:10] jynus: indeed :) [05:45:20] where did you found the logs? [05:45:49] On /srv/backups/ongoing [05:45:52] log.s8 [05:45:54] no [05:46:00] the db1071 logs [05:46:56] Ah [05:47:07] On /srv/sqldata [05:47:20] because clearly those tell the story [05:47:32] I have not analyzed them fully yet [05:47:37] What's the story you see? [05:47:42] STOP; enable getid; START [05:47:55] yes, but the position is the same [05:47:58] (and start replicating on the current position) [05:48:03] no, it is not, afaiks [05:48:25] it is [05:48:36] master_log_pos='1036765620'. that is what GTID starts to [05:48:39] and this is the previous [05:48:46] 180913 10:27:53 [Note] Slave SQL thread exiting, replication stopped in log 'db2045-bin.005879' at position 1036765620 [05:49:58] 'CHANGE MASTER TO executed'. Previous state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005880', master_log_pos='216202297' [05:50:15] New state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005879',master_log_pos='1036765620' [05:51:05] Previous Using_Gtid=No. New Using_Gtid=Slave_Pos [05:52:08] at what time? [05:52:16] 10:27:54 [05:52:33] you wrote that! [05:52:35] but look a line after that [05:52:40] 180913 10:27:54 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2045.codfw.wmnet:3306' in log 'db2045-bin.005879' at position 1036765620 [05:53:04] 180913 10:27:53 [Note] Slave SQL thread exiting, replication stopped in log 'db2045-bin.005879' at position 1036765620 [05:53:09] I don't think that is a coincidence [05:53:16] yeah, but it is weird [05:53:16] I bet that is the missing gap [05:53:21] because if you see the secuence: [05:53:32] 180913 10:27:53 [Note] Slave SQL thread exiting, replication stopped in log 'db2045-bin.005879' at position 1036765620 [05:53:35] 180913 10:27:53 [Note] Slave I/O thread exiting, read up to log 'db2045-bin.005880', position 216202297 [05:53:38] those are the SQL and IO threads [05:53:40] I am quite sure 78 -> 79 is the missing gap [05:53:40] and then [05:53:42] 180913 10:27:54 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005880', master_log_pos='216202297'. New state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005879', master_log_pos='1036765620'. [05:53:48] 180913 10:27:54 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2045.codfw.wmnet:3306' in log 'db2045-bin.005879' at position 1036765620 [05:54:01] so the SQL thread started on the right position [05:54:19] 79 is already the jump [05:54:41] so the IO thread is the one that jumped but SQL thread connected finely [05:54:48] sql may be right, but io may juped [05:54:52] it doesn't matter [05:54:58] position was wrong [05:55:35] and I guessing you didn't indicate a position manually, so gtid autopositioned on the wrong pos [05:55:52] yeah, I always use: STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE ; [05:55:53] based on the local state/time [05:56:01] which is scary [05:56:13] but 10:27 is not when the thing happened [05:56:15] becuase it means that if we stop on the heartbeat [05:56:50] I think it is [05:57:02] as you said, 11:13: We believe the schema change finished [05:57:04] wasn't it 09:08 till 09:58 the gap? [05:57:18] alter finished on 58 [05:57:24] and then it started replicationg [05:57:29] or it started manually [05:57:47] or it got repositioned [05:57:50] etc. [05:58:08] but according to that at 09:08 GTID wasn't enabled [05:58:11] I belive gtid + ongoing local changes is the source of this [05:58:14] it doesn't matter [05:58:16] yeah, me too [05:58:18] it was stopped [05:58:29] ah I get you [05:58:32] and it didn't start at the right position [05:58:54] so it doesn't matter nothing "broke" at 9:08 [05:59:02] but because repl was stopped [05:59:22] it is funny because you wrote the report [05:59:38] and I was like, manuel actually found the cause and didn't say anyting! [05:59:59] haha [06:00:05] when it is the only log you pasted? [06:00:06] I checked only SQL thread [06:00:36] yeah, but if sql was stpped [06:00:41] io will tell the story [06:02:50] I wonder if it is easy to reproduce [06:02:54] because we do this all the time [06:02:59] ALthough I am not sure if we do with lag [06:07:09] we don't do often local changes [06:07:21] with binlog enabled [06:11:24] and with gtid stopped [06:11:26] that is true [06:11:46] so we normally don't do: out of band changes with gtid stopped, and then enable it [06:12:02] we do lots of out of band but with gtid enabled [06:33:57] b*anyek: https://phabricator.wikimedia.org/T207273 we probably need a better description for this task, and at very least, a link to the exact comment on the other task, otherwise if that other task gets 100 comments, looking for v0lans exact comment can be a pain [06:35:15] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10ArielGlenn) >>! In T206743#4675755, @Banyek wrote: > on db1124 with inst... [06:37:06] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) >>! In T206743#4676466, @ArielGlenn wrote: >>>! In T206743#4... [06:50:52] 180913 10:27:53 [Note] Slave I/O thread exiting, read up to log 'db2045-bin.005880', position 216202297 [06:50:56] 180913 10:27:54 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2045.codfw.wmnet:3306' in log 'db2045-bin.005879' at position 1036765620 [06:50:59] • [06:51:01] those are different binlogs [06:51:45] yes [06:51:59] I mentioned the incident hapened between these 2 [06:52:27] I am talking care of db1124 [06:53:18] thanks <3 [06:53:59] not defintely [06:54:06] only to catch up replication [07:34:47] 10DBA, 10monitoring: Parser cache hit ration alerting - https://phabricator.wikimedia.org/T207273 (10Banyek) [07:35:12] banyek: <3 [07:35:35] * marostegui hugs banyek [07:35:48] ;) [07:36:37] That is a lot more clear and actually gives context…remember you might read that task in 6 months, so the previous description would make you go thru tickets to see what the banyek from the past actually meant, with this description you get all the context in 1 minute :) [07:39:43] marostegui: could you also compress that dispatch related table on testwikidatawiki please? [07:40:30] is it causing troubles? [07:40:33] is that on s3? [07:42:24] it only have 4 rows, is it causing issues? :| [07:49:28] 10DBA, 10Wikidata, 10Performance, 10User-Daniel: Use memcached (or something similar) to keep the latest chd_seen state, only flush to table every once in a while - https://phabricator.wikimedia.org/T162558 (10Addshore) p:05Normal>03Low [08:01:56] marostegui: no, but in the interest of keeping the schemas the same everywhere etc :) [08:02:16] addshore: ah sure, I can do that, probably not today though [08:02:20] marostegui: thats fine [08:02:24] shall I write a ticket? [08:02:30] sure [08:02:34] will do [08:02:37] thank you [08:07:36] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: compress wb_changes_dispatch on testwikidatawiki - https://phabricator.wikimedia.org/T207359 (10Addshore) p:05Triage>03Low [08:08:27] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: compress wb_changes_dispatch on testwikidatawiki - https://phabricator.wikimedia.org/T207359 (10Marostegui) For the record, this table only has 4 rows, so it can probably done directly on the master with replication (once... [08:38:24] I deploy the change on parsercache hosts about the replication monitoring checks. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467959/ [08:38:34] yep [08:38:44] I +1ed after reading the plan too [08:38:53] I've seen that <3 [09:20:10] pc2004 and pc1004 are done, so far no errors, checks are in place, and all green [09:20:17] deploying on the rest of the pc hosts [09:20:21] cool [09:20:23] good job! [09:26:09] wb_terms has stopped correcting differences since row 1600 millions [09:26:16] so maybe it stopped? [09:26:32] I will start a check to not work more than needed [09:26:34] Stopped as in no more differences found? [09:26:55] yeah, the following 100M batches showed no diff [09:27:09] pagelinks gave afull compare.py [09:27:13] with no difference [09:27:17] <3 [09:27:23] but that gave diffs until the last batch [09:27:31] btw, we are doing fine without those two hosts [09:27:34] which is good to know [09:27:35] I will start a compare now to see how many we have left [09:27:52] we could repool db1087. but I prefer not to [09:28:01] no need to, we are doing fine [09:28:40] I will then use db1092 to reimport fully pagalinks, page_props and wb_terms into db1087 [09:28:46] as it is faster than doing the check [09:28:55] fully reimport wb_terms? [09:29:12] yes, the changes are spread all over the table [09:29:21] because it is ordered by insertion id [09:29:27] not by page or revision [09:29:46] I am checking, none of those tables hvae triggers [09:29:52] so it will be faster [09:29:52] so that's one less thing to worry about [09:30:06] the others it is faster to insert to at most 8K differences [09:30:13] per table [09:30:23] normally <300 differences [09:30:25] Will you reimport wb_terms already compressed? [09:30:28] Let me check db1124 space pending [09:30:33] well [09:30:37] I will do it in batches [09:30:49] 5TB available on db1124 [09:30:52] so delete a batch and insert it in a loop [09:30:56] So another thing not to worry about [09:31:05] so no space problems [09:31:23] and while most people will be using revision,page,user [09:31:24] and no private data pontential problems either [09:31:34] probaby not many the others [09:31:50] but I don't want to drop and reimport the whole table in one go [09:32:02] batches is easier faster and more secure [09:32:19] I could not do that for the master because the rows must always exist [09:32:22] yeah agreed [09:32:26] so I did it row by row [09:32:36] also even doing it row by row [09:32:40] those 2 tables are not cached [09:32:46] so performance was really bad [09:32:54] 15 minutes to read 100M rows [09:33:07] oh wow [09:33:28] if I do batch inserts, I can do it mostly in memory, then wait let replication catch up, and the import again in memory [09:33:47] slower than normal, but safer and unattended [09:33:58] yeah, and also the table will be available at all times for labs/Tools [09:34:21] the other 20 tables I can do it by hand because I can do each in one go [09:34:33] and I will check with you the list and the triggers, etc [09:34:52] but most stuff is in a good shape [09:34:52] sounds good [09:35:07] however, because it is the master [09:35:13] so triggers we have for: abuse_filter_log archive recentchanges revision user [09:35:14] I will ask you to help me validate it [09:35:20] as I may have made a mistake [09:35:23] sure, anything you need [09:35:35] we can prepare a list of tables and its primary keys [09:35:42] expand the tables_to_compare.txt [09:35:53] and prepare the automation [09:35:55] definitely [09:36:05] I also learned a lot about comparing tables [09:36:20] and I could not program anything, but I have all the requirements for automation of that [09:36:43] which will help with 1) automated validation 2) differential backups [09:36:46] incremental backups coming in! [09:36:49] exactly [09:36:59] also it could integrate with teh binlogs [09:37:10] to replay things from them? [09:37:10] I want to take facebook methodology of doing it [09:37:46] analyze row binlogs and extract data from them- a list of inserts and deletes [09:37:55] without replying the whole thing [09:38:13] it'd be nice to be able to say: replay things from XX timestamp to YY timestamp [09:38:18] Or even from binlog positions [09:38:26] so they can be applied in parallel and generate backups, do checks with replication running, etc. [09:38:27] or certain tables [09:38:45] for example, you compare your backup yesterday with your server today [09:38:57] and with the binlogs you know they are consistent even different timestamp [09:39:00] many possibiltyies [09:39:36] yeah [09:39:45] wb_terms is a table that we could also store until deletion outside of the main databases [09:39:46] Very granular recovery could be possible [09:39:58] Oh btw [09:39:58] as, as far as I know, it is not used except for autocompletion [09:40:09] which should go to elastic [09:40:18] Do I need to generate an insert on the backups table for the new s8 backup that I manually did? [09:40:19] but meanwhile it should go outside of the metadata dbs [09:40:29] insert? [09:40:34] ah [09:40:37] Or update [09:40:40] so you did a manual backup [09:40:42] Yeah [09:40:45] For s8, as it failed [09:40:51] but did not specify where to announce it [09:40:57] so by default it doesn't announce it [09:41:08] So it is now finished, chown'ed and move it to latest, moved the latest to archive [09:41:22] assuming it finished and it is ok, sure [09:41:23] So I assumed I had to update backups table [09:41:33] but maybe you can run the code [09:41:36] sure [09:41:39] without doing a backup [09:41:46] it will be easier [09:41:47] if you point me to it :) [09:41:51] because it stores each file [09:41:54] individually [09:42:03] the code is on path [09:42:19] on the backup tmp servers [09:42:21] so dbstore1001 [09:42:37] but is it part of dump_sectioy.py or is it a different script? [09:42:38] copy it locally [09:42:40] morning, I forgot to say, deleteLocalPassword actually frees up lots of storage in user table. If you saw a huge decrease in size when you do a schema change there, don't be scared [09:42:42] it is the same [09:42:55] I would say to comment the "generate backup" [09:43:06] Coo, I will dig into it [09:43:07] and call the function with the name you just created [09:43:10] Amir1: thanks [09:43:15] you can insert manually, [09:43:19] but with size null [09:43:24] it will complain [09:43:32] or you can downtime the alert for 7 days [09:43:41] so if you run a backup manually, is there a way to also tell it to announce it to the DB? [09:43:45] like —update-db [09:43:59] Amir1: I will finish the s8 recovery soon [09:44:24] will ping you and anomie when I am happy about the state, we are close to be back in a good one [09:44:35] marostegui: indeed [09:44:38] see --help [09:44:41] :-) [09:44:47] marostegui: what I do [09:44:57] Is to copy the backup.cnf [09:45:09] jynus_: no rush, I just finished enwiki and I'm going alphabetically [09:45:09] and just point to the copy so it is easier than parameters [09:45:12] I think I know where you are heading to [09:45:15] yeah [09:45:22] I am not heading [09:45:24] so let it run with the defaults [09:45:25] but it has parameters [09:45:32] but just for tha local .cnf [09:45:35] --stats pr spe,tjomg [09:45:43] stats or something [09:45:46] --stats-host [09:45:49] check the help [09:46:00] so a normal dump_section —config bla bla will update it by default? [09:46:26] if the config has those same parameters [09:46:30] that the default one has [09:46:30] yeah [09:47:13] --statistics-host, --statistics-user, etc [09:47:17] also on the yaml [09:47:19] I think I will do that, that is way easier [09:47:25] I do that [09:47:50] but remember to sudo -u [09:47:53] yep [09:48:10] there is a class for statistics [09:48:16] you can run just that [09:48:43] DatabaseBackupStatistics [09:48:51] yeah, I am reading it [09:49:19] def __init__(self, dump_name, section, source, backup_dir, config) [09:49:27] that initialized the config from the command line [09:50:24] and the you can do stats.start() stats.gather_metrics() stats.finish() [09:50:44] so you gather metrics without dumping [09:51:01] I guess I could add that functionality, but I never thought about that being useful [09:51:13] the poing is to control the real time status [09:51:16] Yeah, not a big deal [09:52:39] I can do it, but honestly, I prefer if you do [09:52:46] to familiarice with it [09:52:47] Yeah [09:53:00] also you can say "this is not intuitive" [09:53:05] and send a patch, etc. [09:53:08] :) [09:53:29] I think this was actually the first time I did a manual backup after all the recent changes [09:53:33] so next time, I just copy the yamls and do it like that [09:53:39] The size of change_tag table in innodb buffer pool for wikidata got a five fold increase since yesterday [09:53:44] I hope that's good [09:53:45] yaml > command line parameters [09:53:50] yeah [09:54:16] Amir1: don't do heavy I_S or P_S querying on production [09:54:20] specially on large tables [09:54:28] * banyek read back, and now scratches his head as trying to get into the context [09:54:34] you are querying 1TB of data in memory [09:54:44] and that id bad for performance [09:54:57] you are prefiling, which is not free [09:55:22] also Amir1 recovery requires a lot of memory changes [09:55:33] so please don't use the current state as a normal one [09:55:47] I will help you evaluate the impact of your deployment, but not this week [09:55:50] and not on s8 [09:56:49] I know you're busy so I don't bother for this week. Let me know when we can check if things are working alright [09:56:56] next week [09:56:59] :-) [09:57:07] (The schema change to drop four index and one column is still needed) [09:58:33] marostegui: 2018-10-18T09:58:01.804147: row id 190263474/2380534342, ETA: 66m07s, 0 chunk(s) found different [09:58:45] pagelinks? [09:58:51] wb_terms [09:58:57] pagelinks finished correctly already [09:59:16] we will see if s8 master + all pooled replicas is "finished" too [09:59:17] Amir1: that schema change is still pending on codfw, once we are out of the woods we can resume it. We have lots of things queuing [09:59:23] jynus nice! [09:59:38] once that is done, I will fix labs [09:59:48] labs as in sanitarium? [09:59:51] and we can think about stopping replication [09:59:52] yes [10:00:05] I consider sanitarium part of the labs infrastructure [10:00:10] even if it is on production [10:00:15] as it is only needed for that [10:00:42] and with labs I mean wikireplicas clouddb hosts [10:00:47] yeah, one could understood labs as labsdb10XX [10:00:49] or should they be cloudb? [10:00:51] like, the hostname [10:01:17] I guess clouddb? [10:01:36] or wikireplicasdb? [10:01:37] and the role cloud::db [10:01:52] I think they (cloud team) call them wikireplicas? [10:01:56] clouddbwikireplicaanalyticsdb.eqiad.wmnet [10:02:26] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ [10:02:33] banyek: this is good for you to read too ^ [10:02:47] sure [10:02:48] so you get an introduction of how they are and work [10:04:06] gah! I launched a full backup on s8 [10:04:18] Anyways, I will leave it running anyways [10:04:28] It won't take long [10:10:26] Thas was a really good post, clean, and meaningful good read; kudos! [10:10:47] banyek: for more context (future) [10:10:55] once analytics renew dbstore1002 [10:11:10] labsdb* 3 hosts will bhe the last 3 to keep multisource [10:11:26] but that promise will not be kept once those are renewed [10:11:43] metadata for all sections just doesn't fit in a single server [10:11:50] in 2-3 year's time [10:12:16] and some wikis like enwiki are way more queried than others, so it would be nice to have more resources for those [10:12:26] How long did it take us to finish that lab project? was it around 1.5y? [10:13:12] when you came in I was arealy working on that [10:13:21] I had setup the missing labsdb1008 [10:13:33] i remember labsdb1009 were already bought [10:13:37] but they were empty [10:13:39] no? [10:13:42] yes [10:13:48] I remember on the offsite we had a chat with labs about the future [10:13:49] but lots of work before that [10:13:52] and that was oct I think [10:13:56] labsdb broke every week [10:13:59] and I reimported enwiki [10:14:13] and it broke one month later due to STATEMENT based replication [10:14:20] after reimporting for 3 months [10:14:24] hahaha [10:14:27] pff [10:14:44] so decided to switch to ROW [10:14:48] due to filters [10:15:05] and then you came in, and we decided the master being in row [10:15:07] I wonder where labsdb1001 and 1003 are now, probably sitting on a cold ground :( [10:15:21] Or beloved labs hosts [10:15:24] they should have been returned, but I don't know if they did [10:15:34] labsdb1002 failed onece on the middle of the night [10:15:37] yeah, I know they were unracked [10:15:41] because raid0 was being used [10:15:43] I never got to meet labsdb1002 [10:15:56] they also crashed due to OOM every day [10:15:57] was it as friendly as 1001 and 1003? [10:16:14] over 50% of my work here was managing labsdb hots [10:16:20] for the first year [10:16:43] because they were so bad [10:16:49] that and do schema changes [10:17:06] omg [10:17:10]