[00:30:58] 10DBA, 10Wikidata, 10Performance, 10User-Daniel: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#3389322 (10Ladsgroup) [00:31:01] 10DBA, 10Wikidata, 10Patch-For-Review, 10Performance, and 3 others: Consider only updating wb_changes_dispatch after a successful run - https://phabricator.wikimedia.org/T162556#3389321 (10Ladsgroup) 05Open>03Resolved [00:31:04] 10DBA, 10Wikidata, 10Performance, 10User-Daniel: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2824480 (10Ladsgroup) [00:31:07] 10DBA, 10Wikidata, 10Patch-For-Review, 10Performance, and 3 others: Use replica for reading the last dispatch position (chd_seen) - https://phabricator.wikimedia.org/T162557#3389323 (10Ladsgroup) 05Open>03Resolved [00:50:49] 10DBA, 10Wikidata, 10Performance, 10User-Daniel: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#3389346 (10Ladsgroup) With the deployment of the changes in the dispatching, in the last 24 hours we had around 1,800 cases of going readonly but thi... [05:25:37] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208#3389527 (10Marostegui) [06:56:28] what work or issue is happening on db1047? [06:57:31] I see/know the alter tables, but that would not explain the monitoring problems? [06:58:13] jynus: Only the alter tables - I haven't checked a lot, but maybe overload and the checks timing out? [06:58:29] why acking, them? [06:59:13] it is ok to leave them off and anyone else can have a look [06:59:16] Because there is work going on and the threads are running finely, so I thought I would downtime them until the work is complete, as they might be flapping [06:59:24] it could be the changes I did yesterday [07:00:31] it is using the wrong socket [07:01:28] did it get restarted or what? [07:01:34] no [07:01:38] but for some reason [07:01:51] etc says the socket is in one place [07:01:59] but icinga searches it on the wrong one [07:02:17] however, the alarm didn't happen yesterday [07:02:28] at least not hours after I reenabled puppet [07:04:15] Maybe it was already downtimed when that happened and it just cabe back from downtime [07:04:59] could be [07:05:03] I will try to fix it [07:05:37] it is ok to leave things unacked, if you do not know why it is happening/have other things to do [07:05:50] we are a team :-D [07:06:04] :) [07:06:14] or ack and create a ticket [07:06:38] normally I ask you first in case you know [07:06:48] like here [07:06:55] Nah, this time I assumed it was the ongoing alters [07:07:00] ok ok [07:07:03] fair enough [07:07:16] (as I have seen it before, specially with the huge alter table I tried for 13 days (!!) on revision table) [07:08:25] I see, it is a very simple mistake [07:08:34] just the default socket location has changed [07:10:21] I created this [07:10:49] but it must be as you said- I didn't realize because downtime or something [07:11:38] yeah, maybe it was downtimed when the alters started [07:47:20] x1 is replicating still from codfw [07:47:24] but it is on row C [07:47:50] es1011 has es2016 as master [07:48:05] but on row D [07:48:18] yeah, only row D is supposed to be affected [07:48:30] I thought it was A? [07:48:34] These are the hosts I found that might page: https://phabricator.wikimedia.org/T168462#3366437 [07:48:37] sorry, A [07:48:46] D was the one done last week [07:49:26] es2018 is on row B [07:49:30] so also unaffected [07:49:43] (I was only chacking replication indirect pages) [07:50:03] yes yes [07:50:07] I don't se those being affected [07:50:28] s4 is disconnected [07:50:33] no, and even on: https://phabricator.wikimedia.org/T168462#3366437 db1068 isn't affected because replicaiton is disconnected [07:50:36] and the other too [07:50:36] yeah [07:50:44] plus those don't page [07:50:55] misc and codfw dont page for replication [07:51:06] because users don't get affected [07:51:11] I would down, however [07:51:39] all hosts and services for a couple of hours- maybe mysql process and disk space pages [07:51:59] what do you think? [07:52:04] why? [07:52:12] to avoid paging [07:52:22] or at least the one that page [07:52:43] but why mysqld process and disk space would page for this maintenance? (they didn't page for row D one) [07:52:50] ok, then [07:53:07] not saying we shouldn't, just trying to understand your thinking process [07:53:10] I thought they may do, but it makes sense they will not [07:53:10] :) [07:53:26] specially if you tell me they didn't the other time [07:53:39] no, last time only es in eqiad paged [07:53:46] I do not trust the service dependency is well configured [07:53:47] for replication broken [07:53:54] but from what you say it is [07:54:10] I think it is because I am accustumed [07:54:24] of when a service goes down, it pagers when it comes back [07:54:33] but only because mysql is not working [07:54:40] it is a new host [07:55:38] ah yeah [07:55:38] there is one extra thing we should check [07:55:50] do you know if some es host is affected? [07:56:15] there are no es masters on rown A [07:56:17] row [07:56:19] slaves [07:56:46] yes [07:56:47] there were some [07:56:49] let me check [07:56:54] es2014 [07:56:57] yes [07:57:03] we should check query patterns [07:57:22] es2017 [07:57:25] because last time we had network issues, they didn't fail right away [07:57:39] and the thing is- we may need app changes [07:57:42] those two [07:57:46] so that they timeout faster [07:57:52] or they may create an outage [07:58:04] this is not part of the downtime, this is a check I would like to do [07:58:15] Sure, I wasn't aware of those issues [07:58:27] so the idea is that if the host is down [07:58:42] they are identified quickly and depooled [07:58:45] but we suspect [07:58:52] (don't know for sure) [07:59:03] that if packages are lost (like here) [07:59:30] the timeut is so large that it could cause issues due to not being quickly depooled [07:59:44] ah right I see [07:59:55] last time the package lost was really small [07:59:57] this would be a good time to check [07:59:59] they were down for seconds [08:00:01] only [08:00:10] oh, so maybe not enough [08:00:18] let's see this time :) [08:36:42] https://gerrit.wikimedia.org/r/#/c/362152/ [08:37:07] oh good [08:37:13] :) [08:37:18] remember to deply after stop [08:37:24] yep :) [08:37:26] of otherwise you will not be able to [08:37:54] good test, btw [08:38:01] I only tested on new hosts [08:47:33] so, did it work? [08:49:22] just came back from the reboot [08:49:26] so starting mysql now :) [08:50:09] worked like a charm :) [08:54:54] icinga, grafana, init.d worked? [08:56:03] looks so :) [09:00:29] I have created https://gerrit.wikimedia.org/r/362156 but I am not in a rush to deploy [09:05:52] dbproxies complain beacuse some m service has cross-dc availability, right? [09:06:43] I believe so [09:06:46] I am checking [09:07:35] yeah, ie: db2011 [09:07:54] which is in row a [09:08:49] ok [09:09:05] we should probably change that on a multi-instance case [10:00:43] hey guys [10:00:49] can we delay the meeting a little bit while we look at the network issues? [10:01:21] fine by me [10:01:23] minutes? an hour? [10:01:32] probably minutes, but no guarantees :) [10:01:53] if you want to take a break or something we can postpone for a bit longer indeed [10:02:17] up to you- if minutes we just hold it, or we can reschedule it [10:02:21] 12:30 ? [10:02:24] we should do it today though [10:02:33] let's try for 12:30, if we don't make that, we'll do it much later today? [10:02:37] ok [10:02:38] sure [10:02:41] cool [10:08:51] 10DBA, 10Labs, 10User-Urbanecm: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193#3390099 (10Urbanecm) [10:10:31] 10DBA, 10Labs, 10User-Urbanecm: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193#3390114 (10Marostegui) As the database isn't created in production yet, please let us know when it is done, so we can sanitize it on sanitarium hosts and labs. And create the views. ``` ro... [12:19:20] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3390515 (10Marostegui) I have reimported `jawiki.watchlist` from production into dbstore1001, removed the ignore replication filters, enable the normal ones and restarted replication on s6.... [13:05:02] 10Blocked-on-schema-change, 10DBA: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3390636 (10Marostegui) [13:42:20] I am going to restart db2072 mariadb several times for systemd tuning [13:43:41] cool [14:06:35] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3390827 (10Marostegui) 05Open>03Resolved a:03Marostegui I have fixed x1 as well - will close this for now as there is not much we can do. At least the two shards that broke are now fixed. [14:14:40] 10DBA, 10Patch-For-Review: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354#3390871 (10Marostegui) An update: The shards keep catching up (as they were really delayed for the maintenance that pretty much lasted a week). x1, s3 and s6 already caught up. T... [16:16:00] jynus: re db104[67] replacements - yes we are ok with standard db hosts, ~4TB in raid10 is good enough for our use case.. If this is ok for you I'd ask to Rob to proceed with quotes etc.. [16:16:27] the thing is [16:16:34] this was supposed to be temporary [16:16:42] and at some point not needed at all [16:16:51] so if we buy a standard config [16:17:03] we can use it for other analytics services, too [16:17:09] sure sure [16:17:13] I am not saying this to take them over [16:17:30] also, disk size is standard- less problems with replacements, etc. [16:17:47] super [16:18:00] for example, we could setup a second dbstore for you and research in the future [16:18:05] etc. [16:18:21] ah yes if in the future EL data will be entirely on HDFS, right [16:18:26] good point [16:18:35] or any other thing you need [16:18:52] or if you do not need mysqls, we take them and change them for other hard [16:19:03] with more disk or whatever [16:19:08] yep yep [16:19:31] all right sounds like we are on the same page, will proceed with quotes :) [16:19:36] what I would be against is the usage of hds [16:19:41] for anything storage [16:19:44] that is live [16:19:53] your alters would be 10x faster [16:19:59] :D [16:20:09] it is ok for dead/static data [18:31:56] 10DBA, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 3 others: archive table needs index starting with timestamp - https://phabricator.wikimedia.org/T164975#3392124 (10Smalyshev) p:05Triage>03Normal @Jcrespo just to be sure, creating test hosts is something you can do? Want to ensure this doesn'...