[04:57:50] banyek|away: db1124 is sanitarium, so not nice to have it broken as it affects labs, but not super urgent. It was expected it to break during the fixes [05:17:46] Yes I just thought it at the end but the alert was almost new so I was like better notice needless, than not [05:17:53] Just woke up [05:25:22] the backups failed for s8, it was a privileges issue [05:25:24] I am fixing it [05:38:42] marostegui did you create the right user? [05:39:20] jynus: yep [05:39:30] it is now flowing [05:40:07] into /srv/tmp.s8, once that is done, I will move the latest one to archive, and this one to latest [05:40:10] does that sound ok? [05:40:48] I am not 100% sure if that is the correct procedure once a manual backup is taken [05:40:57] you are generating it as root [05:41:02] you will need to chown it [05:41:07] Yeah, I was planning to do that [05:41:15] (as I realised I did it as root instead of dump) [05:43:51] the alerts work :-) [05:43:54] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update from 17th at 19:04 from Jaime: all tables except wb_t... [05:44:10] jynus: indeed :) [05:45:20] where did you found the logs? [05:45:49] On /srv/backups/ongoing [05:45:52] log.s8 [05:45:54] no [05:46:00] the db1071 logs [05:46:56] Ah [05:47:07] On /srv/sqldata [05:47:20] because clearly those tell the story [05:47:32] I have not analyzed them fully yet [05:47:37] What's the story you see? [05:47:42] STOP; enable getid; START [05:47:55] yes, but the position is the same [05:47:58] (and start replicating on the current position) [05:48:03] no, it is not, afaiks [05:48:25] it is [05:48:36] master_log_pos='1036765620'. that is what GTID starts to [05:48:39] and this is the previous [05:48:46] 180913 10:27:53 [Note] Slave SQL thread exiting, replication stopped in log 'db2045-bin.005879' at position 1036765620 [05:49:58] 'CHANGE MASTER TO executed'. Previous state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005880', master_log_pos='216202297' [05:50:15] New state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005879',master_log_pos='1036765620' [05:51:05] Previous Using_Gtid=No. New Using_Gtid=Slave_Pos [05:52:08] at what time? [05:52:16] 10:27:54 [05:52:33] you wrote that! [05:52:35] but look a line after that [05:52:40] 180913 10:27:54 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2045.codfw.wmnet:3306' in log 'db2045-bin.005879' at position 1036765620 [05:53:04] 180913 10:27:53 [Note] Slave SQL thread exiting, replication stopped in log 'db2045-bin.005879' at position 1036765620 [05:53:09] I don't think that is a coincidence [05:53:16] yeah, but it is weird [05:53:16] I bet that is the missing gap [05:53:21] because if you see the secuence: [05:53:32] 180913 10:27:53 [Note] Slave SQL thread exiting, replication stopped in log 'db2045-bin.005879' at position 1036765620 [05:53:35] 180913 10:27:53 [Note] Slave I/O thread exiting, read up to log 'db2045-bin.005880', position 216202297 [05:53:38] those are the SQL and IO threads [05:53:40] I am quite sure 78 -> 79 is the missing gap [05:53:40] and then [05:53:42] 180913 10:27:54 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005880', master_log_pos='216202297'. New state master_host='db2045.codfw.wmnet', master_port='3306', master_log_file='db2045-bin.005879', master_log_pos='1036765620'. [05:53:48] 180913 10:27:54 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2045.codfw.wmnet:3306' in log 'db2045-bin.005879' at position 1036765620 [05:54:01] so the SQL thread started on the right position [05:54:19] 79 is already the jump [05:54:41] so the IO thread is the one that jumped but SQL thread connected finely [05:54:48] sql may be right, but io may juped [05:54:52] it doesn't matter [05:54:58] position was wrong [05:55:35] and I guessing you didn't indicate a position manually, so gtid autopositioned on the wrong pos [05:55:52] yeah, I always use: STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE ; [05:55:53] based on the local state/time [05:56:01] which is scary [05:56:13] but 10:27 is not when the thing happened [05:56:15] becuase it means that if we stop on the heartbeat [05:56:50] I think it is [05:57:02] as you said, 11:13: We believe the schema change finished [05:57:04] wasn't it 09:08 till 09:58 the gap? [05:57:18] alter finished on 58 [05:57:24] and then it started replicationg [05:57:29] or it started manually [05:57:47] or it got repositioned [05:57:50] etc. [05:58:08] but according to that at 09:08 GTID wasn't enabled [05:58:11] I belive gtid + ongoing local changes is the source of this [05:58:14] it doesn't matter [05:58:16] yeah, me too [05:58:18] it was stopped [05:58:29] ah I get you [05:58:32] and it didn't start at the right position [05:58:54] so it doesn't matter nothing "broke" at 9:08 [05:59:02] but because repl was stopped [05:59:22] it is funny because you wrote the report [05:59:38] and I was like, manuel actually found the cause and didn't say anyting! [05:59:59] haha [06:00:05] when it is the only log you pasted? [06:00:06] I checked only SQL thread [06:00:36] yeah, but if sql was stpped [06:00:41] io will tell the story [06:02:50] I wonder if it is easy to reproduce [06:02:54] because we do this all the time [06:02:59] ALthough I am not sure if we do with lag [06:07:09] we don't do often local changes [06:07:21] with binlog enabled [06:11:24] and with gtid stopped [06:11:26] that is true [06:11:46] so we normally don't do: out of band changes with gtid stopped, and then enable it [06:12:02] we do lots of out of band but with gtid enabled [06:33:57] b*anyek: https://phabricator.wikimedia.org/T207273 we probably need a better description for this task, and at very least, a link to the exact comment on the other task, otherwise if that other task gets 100 comments, looking for v0lans exact comment can be a pain [06:35:15] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10ArielGlenn) >>! In T206743#4675755, @Banyek wrote: > on db1124 with inst... [06:37:06] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) >>! In T206743#4676466, @ArielGlenn wrote: >>>! In T206743#4... [06:50:52] 180913 10:27:53 [Note] Slave I/O thread exiting, read up to log 'db2045-bin.005880', position 216202297 [06:50:56] 180913 10:27:54 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db2045.codfw.wmnet:3306' in log 'db2045-bin.005879' at position 1036765620 [06:50:59] • [06:51:01] those are different binlogs [06:51:45] yes [06:51:59] I mentioned the incident hapened between these 2 [06:52:27] I am talking care of db1124 [06:53:18] thanks <3 [06:53:59] not defintely [06:54:06] only to catch up replication [07:34:47] 10DBA, 10monitoring: Parser cache hit ration alerting - https://phabricator.wikimedia.org/T207273 (10Banyek) [07:35:12] banyek: <3 [07:35:35] * marostegui hugs banyek [07:35:48] ;) [07:36:37] That is a lot more clear and actually gives context…remember you might read that task in 6 months, so the previous description would make you go thru tickets to see what the banyek from the past actually meant, with this description you get all the context in 1 minute :) [07:39:43] marostegui: could you also compress that dispatch related table on testwikidatawiki please? [07:40:30] is it causing troubles? [07:40:33] is that on s3? [07:42:24] it only have 4 rows, is it causing issues? :| [07:49:28] 10DBA, 10Wikidata, 10Performance, 10User-Daniel: Use memcached (or something similar) to keep the latest chd_seen state, only flush to table every once in a while - https://phabricator.wikimedia.org/T162558 (10Addshore) p:05Normal>03Low [08:01:56] marostegui: no, but in the interest of keeping the schemas the same everywhere etc :) [08:02:16] addshore: ah sure, I can do that, probably not today though [08:02:20] marostegui: thats fine [08:02:24] shall I write a ticket? [08:02:30] sure [08:02:34] will do [08:02:37] thank you [08:07:36] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: compress wb_changes_dispatch on testwikidatawiki - https://phabricator.wikimedia.org/T207359 (10Addshore) p:05Triage>03Low [08:08:27] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: compress wb_changes_dispatch on testwikidatawiki - https://phabricator.wikimedia.org/T207359 (10Marostegui) For the record, this table only has 4 rows, so it can probably done directly on the master with replication (once... [08:38:24] I deploy the change on parsercache hosts about the replication monitoring checks. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467959/ [08:38:34] yep [08:38:44] I +1ed after reading the plan too [08:38:53] I've seen that <3 [09:20:10] pc2004 and pc1004 are done, so far no errors, checks are in place, and all green [09:20:17] deploying on the rest of the pc hosts [09:20:21] cool [09:20:23] good job! [09:26:09] wb_terms has stopped correcting differences since row 1600 millions [09:26:16] so maybe it stopped? [09:26:32] I will start a check to not work more than needed [09:26:34] Stopped as in no more differences found? [09:26:55] yeah, the following 100M batches showed no diff [09:27:09] pagelinks gave afull compare.py [09:27:13] with no difference [09:27:17] <3 [09:27:23] but that gave diffs until the last batch [09:27:31] btw, we are doing fine without those two hosts [09:27:34] which is good to know [09:27:35] I will start a compare now to see how many we have left [09:27:52] we could repool db1087. but I prefer not to [09:28:01] no need to, we are doing fine [09:28:40] I will then use db1092 to reimport fully pagalinks, page_props and wb_terms into db1087 [09:28:46] as it is faster than doing the check [09:28:55] fully reimport wb_terms? [09:29:12] yes, the changes are spread all over the table [09:29:21] because it is ordered by insertion id [09:29:27] not by page or revision [09:29:46] I am checking, none of those tables hvae triggers [09:29:52] so it will be faster [09:29:52] so that's one less thing to worry about [09:30:06] the others it is faster to insert to at most 8K differences [09:30:13] per table [09:30:23] normally <300 differences [09:30:25] Will you reimport wb_terms already compressed? [09:30:28] Let me check db1124 space pending [09:30:33] well [09:30:37] I will do it in batches [09:30:49] 5TB available on db1124 [09:30:52] so delete a batch and insert it in a loop [09:30:56] So another thing not to worry about [09:31:05] so no space problems [09:31:23] and while most people will be using revision,page,user [09:31:24] and no private data pontential problems either [09:31:34] probaby not many the others [09:31:50] but I don't want to drop and reimport the whole table in one go [09:32:02] batches is easier faster and more secure [09:32:19] I could not do that for the master because the rows must always exist [09:32:22] yeah agreed [09:32:26] so I did it row by row [09:32:36] also even doing it row by row [09:32:40] those 2 tables are not cached [09:32:46] so performance was really bad [09:32:54] 15 minutes to read 100M rows [09:33:07] oh wow [09:33:28] if I do batch inserts, I can do it mostly in memory, then wait let replication catch up, and the import again in memory [09:33:47] slower than normal, but safer and unattended [09:33:58] yeah, and also the table will be available at all times for labs/Tools [09:34:21] the other 20 tables I can do it by hand because I can do each in one go [09:34:33] and I will check with you the list and the triggers, etc [09:34:52] but most stuff is in a good shape [09:34:52] sounds good [09:35:07] however, because it is the master [09:35:13] so triggers we have for: abuse_filter_log archive recentchanges revision user [09:35:14] I will ask you to help me validate it [09:35:20] as I may have made a mistake [09:35:23] sure, anything you need [09:35:35] we can prepare a list of tables and its primary keys [09:35:42] expand the tables_to_compare.txt [09:35:53] and prepare the automation [09:35:55] definitely [09:36:05] I also learned a lot about comparing tables [09:36:20] and I could not program anything, but I have all the requirements for automation of that [09:36:43] which will help with 1) automated validation 2) differential backups [09:36:46] incremental backups coming in! [09:36:49] exactly [09:36:59] also it could integrate with teh binlogs [09:37:10] to replay things from them? [09:37:10] I want to take facebook methodology of doing it [09:37:46] analyze row binlogs and extract data from them- a list of inserts and deletes [09:37:55] without replying the whole thing [09:38:13] it'd be nice to be able to say: replay things from XX timestamp to YY timestamp [09:38:18] Or even from binlog positions [09:38:26] so they can be applied in parallel and generate backups, do checks with replication running, etc. [09:38:27] or certain tables [09:38:45] for example, you compare your backup yesterday with your server today [09:38:57] and with the binlogs you know they are consistent even different timestamp [09:39:00] many possibiltyies [09:39:36] yeah [09:39:45] wb_terms is a table that we could also store until deletion outside of the main databases [09:39:46] Very granular recovery could be possible [09:39:58] Oh btw [09:39:58] as, as far as I know, it is not used except for autocompletion [09:40:09] which should go to elastic [09:40:18] Do I need to generate an insert on the backups table for the new s8 backup that I manually did? [09:40:19] but meanwhile it should go outside of the metadata dbs [09:40:29] insert? [09:40:34] ah [09:40:37] Or update [09:40:40] so you did a manual backup [09:40:42] Yeah [09:40:45] For s8, as it failed [09:40:51] but did not specify where to announce it [09:40:57] so by default it doesn't announce it [09:41:08] So it is now finished, chown'ed and move it to latest, moved the latest to archive [09:41:22] assuming it finished and it is ok, sure [09:41:23] So I assumed I had to update backups table [09:41:33] but maybe you can run the code [09:41:36] sure [09:41:39] without doing a backup [09:41:46] it will be easier [09:41:47] if you point me to it :) [09:41:51] because it stores each file [09:41:54] individually [09:42:03] the code is on path [09:42:19] on the backup tmp servers [09:42:21] so dbstore1001 [09:42:37] but is it part of dump_sectioy.py or is it a different script? [09:42:38] copy it locally [09:42:40] morning, I forgot to say, deleteLocalPassword actually frees up lots of storage in user table. If you saw a huge decrease in size when you do a schema change there, don't be scared [09:42:42] it is the same [09:42:55] I would say to comment the "generate backup" [09:43:06] Coo, I will dig into it [09:43:07] and call the function with the name you just created [09:43:10] Amir1: thanks [09:43:15] you can insert manually, [09:43:19] but with size null [09:43:24] it will complain [09:43:32] or you can downtime the alert for 7 days [09:43:41] so if you run a backup manually, is there a way to also tell it to announce it to the DB? [09:43:45] like —update-db [09:43:59] Amir1: I will finish the s8 recovery soon [09:44:24] will ping you and anomie when I am happy about the state, we are close to be back in a good one [09:44:35] marostegui: indeed [09:44:38] see --help [09:44:41] :-) [09:44:47] marostegui: what I do [09:44:57] Is to copy the backup.cnf [09:45:09] jynus_: no rush, I just finished enwiki and I'm going alphabetically [09:45:09] and just point to the copy so it is easier than parameters [09:45:12] I think I know where you are heading to [09:45:15] yeah [09:45:22] I am not heading [09:45:24] so let it run with the defaults [09:45:25] but it has parameters [09:45:32] but just for tha local .cnf [09:45:35] --stats pr spe,tjomg [09:45:43] stats or something [09:45:46] --stats-host [09:45:49] check the help [09:46:00] so a normal dump_section —config bla bla will update it by default? [09:46:26] if the config has those same parameters [09:46:30] that the default one has [09:46:30] yeah [09:47:13] --statistics-host, --statistics-user, etc [09:47:17] also on the yaml [09:47:19] I think I will do that, that is way easier [09:47:25] I do that [09:47:50] but remember to sudo -u [09:47:53] yep [09:48:10] there is a class for statistics [09:48:16] you can run just that [09:48:43] DatabaseBackupStatistics [09:48:51] yeah, I am reading it [09:49:19] def __init__(self, dump_name, section, source, backup_dir, config) [09:49:27] that initialized the config from the command line [09:50:24] and the you can do stats.start() stats.gather_metrics() stats.finish() [09:50:44] so you gather metrics without dumping [09:51:01] I guess I could add that functionality, but I never thought about that being useful [09:51:13] the poing is to control the real time status [09:51:16] Yeah, not a big deal [09:52:39] I can do it, but honestly, I prefer if you do [09:52:46] to familiarice with it [09:52:47] Yeah [09:53:00] also you can say "this is not intuitive" [09:53:05] and send a patch, etc. [09:53:08] :) [09:53:29] I think this was actually the first time I did a manual backup after all the recent changes [09:53:33] so next time, I just copy the yamls and do it like that [09:53:39] The size of change_tag table in innodb buffer pool for wikidata got a five fold increase since yesterday [09:53:44] I hope that's good [09:53:45] yaml > command line parameters [09:53:50] yeah [09:54:16] Amir1: don't do heavy I_S or P_S querying on production [09:54:20] specially on large tables [09:54:28] * banyek read back, and now scratches his head as trying to get into the context [09:54:34] you are querying 1TB of data in memory [09:54:44] and that id bad for performance [09:54:57] you are prefiling, which is not free [09:55:22] also Amir1 recovery requires a lot of memory changes [09:55:33] so please don't use the current state as a normal one [09:55:47] I will help you evaluate the impact of your deployment, but not this week [09:55:50] and not on s8 [09:56:49] I know you're busy so I don't bother for this week. Let me know when we can check if things are working alright [09:56:56] next week [09:56:59] :-) [09:57:07] (The schema change to drop four index and one column is still needed) [09:58:33] marostegui: 2018-10-18T09:58:01.804147: row id 190263474/2380534342, ETA: 66m07s, 0 chunk(s) found different [09:58:45] pagelinks? [09:58:51] wb_terms [09:58:57] pagelinks finished correctly already [09:59:16] we will see if s8 master + all pooled replicas is "finished" too [09:59:17] Amir1: that schema change is still pending on codfw, once we are out of the woods we can resume it. We have lots of things queuing [09:59:23] jynus nice! [09:59:38] once that is done, I will fix labs [09:59:48] labs as in sanitarium? [09:59:51] and we can think about stopping replication [09:59:52] yes [10:00:05] I consider sanitarium part of the labs infrastructure [10:00:10] even if it is on production [10:00:15] as it is only needed for that [10:00:42] and with labs I mean wikireplicas clouddb hosts [10:00:47] yeah, one could understood labs as labsdb10XX [10:00:49] or should they be cloudb? [10:00:51] like, the hostname [10:01:17] I guess clouddb? [10:01:36] or wikireplicasdb? [10:01:37] and the role cloud::db [10:01:52] I think they (cloud team) call them wikireplicas? [10:01:56] clouddbwikireplicaanalyticsdb.eqiad.wmnet [10:02:26] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ [10:02:33] banyek: this is good for you to read too ^ [10:02:47] sure [10:02:48] so you get an introduction of how they are and work [10:04:06] gah! I launched a full backup on s8 [10:04:18] Anyways, I will leave it running anyways [10:04:28] It won't take long [10:10:26] Thas was a really good post, clean, and meaningful good read; kudos! [10:10:47] banyek: for more context (future) [10:10:55] once analytics renew dbstore1002 [10:11:10] labsdb* 3 hosts will bhe the last 3 to keep multisource [10:11:26] but that promise will not be kept once those are renewed [10:11:43] metadata for all sections just doesn't fit in a single server [10:11:50] in 2-3 year's time [10:12:16] and some wikis like enwiki are way more queried than others, so it would be nice to have more resources for those [10:12:26] How long did it take us to finish that lab project? was it around 1.5y? [10:13:12] when you came in I was arealy working on that [10:13:21] I had setup the missing labsdb1008 [10:13:33] i remember labsdb1009 were already bought [10:13:37] but they were empty [10:13:39] no? [10:13:42] yes [10:13:48] I remember on the offsite we had a chat with labs about the future [10:13:49] but lots of work before that [10:13:52] and that was oct I think [10:13:56] labsdb broke every week [10:13:59] and I reimported enwiki [10:14:13] and it broke one month later due to STATEMENT based replication [10:14:20] after reimporting for 3 months [10:14:24] hahaha [10:14:27] pff [10:14:44] so decided to switch to ROW [10:14:48] due to filters [10:15:05] and then you came in, and we decided the master being in row [10:15:07] I wonder where labsdb1001 and 1003 are now, probably sitting on a cold ground :( [10:15:21] Or beloved labs hosts [10:15:24] they should have been returned, but I don't know if they did [10:15:34] labsdb1002 failed onece on the middle of the night [10:15:37] yeah, I know they were unracked [10:15:41] because raid0 was being used [10:15:43] I never got to meet labsdb1002 [10:15:56] they also crashed due to OOM every day [10:15:57] was it as friendly as 1001 and 1003? [10:16:14] over 50% of my work here was managing labsdb hots [10:16:20] for the first year [10:16:43] because they were so bad [10:16:49] that and do schema changes [10:17:06] omg [10:17:10] which hadn't been done in the last 6 months 1 year [10:17:33] labsdb used tokudb [10:17:39] which is ok for analytics (key value) [10:17:52] but for meadata, they crashed or gey locked eny time [10:17:59] it was horrible [10:18:00] because deifferent query plan than innodb [10:18:06] lag at all times [10:18:12] and that bug on the old db1069 with replication stuck on toku... [10:18:14] there was not even a way to measure lag [10:18:24] people used to check the latest revision insert [10:18:29] and check based on that [10:18:42] yes, remember for me labs includes sanitarium [11:04:22] 2018-10-18T11:03:00.550729: row id 640263474/2380534342, ETA: 192m18s, 0 chunk(s) found different [11:08:16] yay [11:10:59] I am going to have some lunchtime [11:49:15] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375 (10Urbanecm) [11:50:53] 10DBA, 10Cloud-Services: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193 (10Urbanecm) [11:51:02] 10DBA, 10Patch-For-Review: Prepare and check storage layer for id_internalwikimedia - https://phabricator.wikimedia.org/T196748 (10Urbanecm) [11:51:12] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112 (10Urbanecm) [12:20:20] ^ I'll handle those [12:21:48] thanks banyek :) [12:23:27] Ah [12:23:31] Those are already done [12:23:37] I knew gorwiki sounded familiar [12:25:33] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) Is this alert fully deployed? [12:27:33] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10jcrespo) I am not sure how useful is this, honestly- this alert would have not prevented the issue at all: ``` MariaDB Slave IO: pc1 OK... [12:29:05] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) >>! In T206992#4677326, @jcrespo wrote: > I am not sure how useful is this, honestly- this alert would have not prevented the... [12:32:21] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10jcrespo) > I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I wo... [12:33:52] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) >>! In T206992#4677332, @jcrespo wrote: >> I did check the parsercache hosts before the failover, to make sure they were all g... [12:34:57] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10jcrespo) > I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for ne... [12:36:06] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) @Volans ^ is that something we can do on the dc switchover script? [12:36:27] so I am back [12:38:09] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Banyek) 05Open>03Resolved @jcrespo Yes, it is deployed, I was just waiting on close [12:38:38] ains [12:38:43] sigh [12:39:27] 2018-10-18T12:38:58.974183: row id 1240263474/2380534342, ETA: 153m17s, 0 chunk(s) found different [12:45:54] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 (10Marostegui) I have created T207385 so we can follow the discussion there. [12:50:16] 10DBA, 10monitoring: Parser cache hit ration alerting - https://phabricator.wikimedia.org/T207273 (10jcrespo) Parser cache hit ratio alerting is difficult, specially on a passive DC. A better option would be a script that checks that most of the content is not expired, aka "does not contain mostly garbage". Th... [12:52:48] I have expanded on my thoughts on the 2 tickets [12:53:01] I am not saying it has to be that, just giving an expanded suggestion [12:53:26] and then you can build on that, proposing alternatives, etc [12:53:36] but I wanted a full initial proposal [12:53:37] 10DBA, 10monitoring: Parser cache hit ratio alerting - https://phabricator.wikimedia.org/T207273 (10Banyek) [12:54:18] marostegui: one question before you are busy [12:54:47] (maybe you are already) [12:55:20] yes? [12:55:30] replication changes gtid, etc [12:55:40] this week, do we wait next week? [12:55:44] I don't want to do those on my own [12:55:50] let's do them on monday [12:55:53] +1 [12:56:04] wanted to know your toughts, I agree [12:56:11] I will focus on having labs fixed then [12:57:00] <3 [13:04:04] 10DBA, 10monitoring: Parser cache hit ratio alerting - https://phabricator.wikimedia.org/T207273 (10Volans) My suggestion for this kind of check was not for the passive dc, but mainly the active one to make sure that the parser caches are properly used. We might have changes in mediawiki that will change the h... [13:07:26] 10DBA, 10monitoring: Parser cache hit ratio alerting - https://phabricator.wikimedia.org/T207273 (10jcrespo) I don't think having such alarm is bad- it is easy to setup, just setting up a prometheus one- but it may arrive too late. I think a check on switchdc would prevent issues rather than identify them afte... [13:10:03] 10DBA, 10monitoring: Parser cache hit ratio alerting - https://phabricator.wikimedia.org/T207273 (10Volans) That's exactly what I meant, we should have this check independently and adding other checks to the other part described in T207385 to prevent it. [13:10:40] volans: let's agree we agree [13:10:43] :-D [13:10:48] totally :) [13:11:26] I think the check you propose if good, but good in general [13:11:41] yep was not meant to prevent what happened [13:11:41] while the code check is more the actinable [13:11:54] there are other reasons to keep the other check [13:12:06] e.g. if mediawiki code breaks pc [13:12:54] and instantly all or many caches are invalid [13:13:04] as, not exactly, but somewhat happened begore [13:13:20] (less dramatically) [13:13:30] indeed [13:13:42] having bigger and more pc servers will help too [13:25:30] sometimes have nightmares during the night thinking wikidata.term_row_id is defined as an int and we are inserting maxint id columns [13:25:50] I meant wikidatawiki.wb_terms.term_row_id [13:31:42] you have really weird dreams then I have to say 🤣 [13:31:54] In my nightmares I am smoking again [13:38:10] jynus: marostegui: I know this is late to tell, but are you aware that I'll be away Mon-Tue? (It's a national holday and a 4 day long weekend here) [13:39:07] I wasn't [13:39:12] Put it in your calendar [13:56:00] We'll start working soon on db2042 BBU with p4paul, but we were talking about this earlier here, here's my quick recap: [13:56:11] - db2042 is a backup host, if no backup is running on that it's safe to power-off [13:56:11] - before powering off the binlog file/pos should be written down as it replicating from db1072 [13:56:11] - if something bad happens (corrupted fs, server doesn't boot up, etc.) db2078 should be repositioned to db1072 [13:56:18] did I missed something? [13:56:26] marostegui: ^ [13:56:39] db2042 is a backup host? [13:57:02] I guess it is a misc? [13:58:12] maybe we should change backups to db2078:3323, but not relevant right now [13:58:58] banyek: what *I* would do, stress on I [13:59:14] is stop replication, make sure the replicas is on the same binlog [13:59:27] and write down the master position [13:59:32] ok [13:59:33] but I guess that is what you meant? [13:59:51] "having enough information to repoint db2078" [13:59:52] yes, except changing the backups [14:00:00] yeah, that is not now [14:00:10] backups should not be runinng now [14:00:27] don't worry [14:02:14] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team (MCR), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10daniel) [14:03:53] I checked, last backup was ran on 10.16 and was successful [14:04:56] I'll prepare the host for papaul (stopping replication, writing down master position, stopping mysql and power down the host, so he can work whenever he can) [14:05:15] thank you, banyek [14:17:55] ```CHANGE MASTER TO MASTER_HOST='db1072.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='', MASTER_LOG_FILE='db1072-bin.000510',MASTER_LOG_POS=700686527, MASTER_SSL=1;``` (if we need to reposition db2078:3323) [14:18:40] thanks, having it here is a great way to make sure it is not lost [14:18:54] and also I have it in case for some reasons you disconnected [14:21:07] Downtimed host, now shutting down MySQL [14:21:34] did you downtime also io and lag on the replica? [14:21:44] otherwise it will complain too [14:22:54] 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Banyek) [14:23:25] banyek^ [14:23:28] oh, no, good that you mentioned, downtiming them [14:23:43] disaable alerts temporarilly [14:23:53] if it is soft alerting it will log anyway [14:40:11] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Papaul) a:05Papaul>03Banyek Disk replacement complete [14:42:36] Amir1: running password stuff on s3, right? [14:42:50] I saw db1095 (not part of mw) lagging, and I think is that [14:43:02] I just want to make sure I don't have to worry [14:43:21] if it's not part of mw, it should not be affected [14:43:34] do you want me to stop it to make sure? [14:43:34] It replicates it [14:43:38] no, it is ok [14:43:47] if you tell me "yes, it is running" is enough [14:44:02] it's running [14:44:04] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Banyek) 05Open>03Resolved Perfect, thank you! The logical drive is getting rebuilded: ``` Smart Array P420i in Slot 0 (Embedded) array A Logical Drive: 1 Size: 3.3 TB Fa... [14:44:05] cool [14:44:08] that was all :-D [14:44:14] :P [14:44:25] apprently incubatorwiki has a lot of users [14:44:44] but on non mw-servig traffic those tables will be cold [14:45:16] I will ack it on icinga [14:50:56] hello. do you still need/use/remember this Icinga check command: check_lonqqueries.pl ? I am wondering whether i should make it work on stretch for icinga1001 or remove it [14:51:09] right now it seems like we have the script and checkcommand but are not using it [14:51:53] this is how it looks https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468066/4/modules/icinga/files/check_longqueries.pl [14:54:44] we should remove it [14:55:28] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Marostegui) 05Resolved>03Open Leave it open until it finally gets rebuilt. They fail quite often unfortunately specially on old hosts and they need Papaul or Chris to pull the disk out and then back in [14:56:16] mutante: shouldn't we remove the one above and the one below too? [14:57:24] you can merge that, but extra clean up will be needed later [15:00:39] jynus_: thanks for the review:) appreciate it. i did not look at the other mysql related checks because my focus was on "all scripts using the Perl module Nagios::Plugin". The reason is that it was renamed to Monitoring::Plugin in stretch (Nagios threatened CPAN because trademarks or so). .so they would break on icinga1001 which i want to get into prod [15:02:35] after draining power on db2042 the BBU seems working again, but I asked papaul for checking the spare BBU we have (which were not working on dbstore2002) to test, just to know if we have a spare to count on or not. [15:04:43] mutante: didn't icinga used the mysql module [15:05:14] aka T162070 [15:05:15] T162070: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 [15:05:44] (going alphabetically, kmwiki right now) [15:05:51] it is ok [15:05:58] I just needed s3 confirmation [15:06:06] that will take a while [15:06:20] if we go to a separate section is when I will be interested [15:10:41] jynus_: we made it so that it will stop using the mysql module on stretch [15:10:51] cool, thanks [15:10:54] instead if will do require_package('mariadb-client') [15:11:07] that's all that include did anyways [15:12:56] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) icinga will stop using the mysql module once on stretch. Once T202782 is resolved and einsteinium isn't the prod Icinga server... [15:20:48] Amir1: your script hitting s3 again I guess? [15:21:24] Probably the same issue with s1 on Monday, going fast on s1 and s3 codfw master struggling because of the lack of SSDs [15:21:37] yup [15:22:29] codfw struggles (normally) because extra latency and no wait there [15:22:43] which makes me doubt how that will work with cross-dc [15:22:51] and the lack of DCs [15:22:53] SSDs [15:23:29] we should leave the old masters pooled with weight 1 once they are replaced, so we can throttle the scripts :p [15:24:50] this is a one time thing (at least the script) [15:25:06] but I can see the concern for later as well [15:25:25] Yeah, so far we are not active-active [15:25:32] but when we are….we need to see how to handle those [15:29:10] Amir1: will you script run over night? [15:29:27] Just asking to see if I have to give our US folks a heads up about possible delays on codfw [15:39:57] marostegui: very likely [15:40:23] That BBU is not working, so we don't have a spare one [15:40:41] :_( [15:40:44] At least it is confirmed [15:40:48] Amir1: cheers [15:41:14] banyek: then I would say let's then get the server back and then we can plan for a DC failover sometime [15:41:49] actually after power drain the original BBU seems working as in dbstore2002 happened too [15:41:56] ha [15:42:10] the majesty of power it off and on! [15:43:23] I will leave db1095 downtime till tomorrow [15:43:46] heh [15:43:49] ok, cool [15:57:29] db2042 is back in action, it replicates well, and db2078:3323 too. The BBU seems working, I mark the task resolved [15:57:56] nothing helps as much as a reboot [15:59:48] 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Banyek) 05stalled>03Resolved a:03Banyek @Papaul did power drain that fixed the battery status. We tried our spare battery in this host as well (T205257) but it doesn... [16:01:20] 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) if it fails again, I suggest we go for a DC failover. [16:01:42] anyways, I go now for cultural coordination [16:03:54] I have officially finished the s8 fixes now [16:04:02] <3 [16:04:09] I will need some extra checks, however to be 100% sure [16:04:20] Why the DC failover then? It is not the 1072 one which is used now? [16:04:21] let me know how I can help [16:04:33] Jynus_: omg, KUDOS [16:04:42] banyek: that is why I said: DC failover and not primary master failover :) [16:05:07] marostegui: I am not going to touch s8 master anymore [16:05:26] it would be nice to have some compare.py or something for a double check [16:05:31] yeah [16:05:36] is also labs done too? [16:05:38] I will focus tomorrow on labs [16:05:41] ^ [16:05:44] ah cool [16:05:55] probably fixed by weekend [16:06:18] we will see [16:06:41] * marostegui updates timelines [16:06:47] I was mostly worried about the master because it could cause a replication break [16:07:01] yeah it would be a good cascade [16:09:21] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update from Jaime 18th Oct 16:05: s8 core hosts all finished... [16:09:38] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [16:09:49] db2042 catched up. [20:25:56] 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Volans) db2042 failed to start `ferm` at reboot due to a DNS timeout query: ``` Oct 18 15:53:04 db2042 ferm[837]: DNS query for 'prometheus2003.codfw.wmnet' failed: query t... [20:30:42] 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Volans) Opened T207417 for the ferm part. [21:07:00] 10DBA, 10JADE, 10Operations, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 4 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui We've merged the DDL to our repo in order to unblock development, so here ar...