[00:36:04] 10DBA, 10Phabricator, 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3965756 (10mmodell) [00:36:53] 10DBA, 10Phabricator, 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3965768 (10mmodell) p:05Triage>03Normal [01:16:22] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3965805 (10Krinkle) [06:33:29] 10DBA, 10Phabricator, 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3966122 (10Marostegui) Midnight UTC would be impossible for me Thursday UTC morning I could do it (I could be around 6AM UTC or so). Would that work? About th... [06:37:05] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3966131 (10Marostegui) [06:58:42] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3966165 (10Marostegui) [07:31:17] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3966210 (10Marostegui) So, just to confirm what I was suspecting yesterday (but pending to confirm today). To be able to entirel... [07:53:49] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3966227 (10Marostegui) And these are how the tables look like: ``` for i in archive image ipblocks oldimage revision image; do e... [07:54:32] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3966228 (10Marostegui) Progress of s5 (for this initial alter I am doing it on codfw host by host) once we have seen no replicat... [07:54:59] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3966229 (10Marostegui) [08:02:05] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3966232 (10Marostegui) I have finished the text table. Next: oldimage [08:51:03] 10DBA, 10Phabricator, 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3966303 (10jcrespo) Can you share the migration script in advance? [09:07:11] ready for 409008 ? [09:07:20] sure, give me 1 sec [09:07:33] 1 second gone, starting [09:07:38] just kidding [09:07:41] actually….go! [09:07:44] I can give you 2 seconds [09:07:44] I am ready :) [09:07:55] or even 3, I am generous today [09:08:22] taking 5 minutes myself [09:08:25] haha [09:20:10] I am going to start disabling puppet on all eqiad dbs [09:20:17] ok [09:29:16] cumin 'es1*' 'disable-puppet...' is getting blocked [09:29:53] if puppet is running it waits that it finish [09:30:08] but has a timeout by default [09:30:09] which is ok, but it is taking too long [09:30:25] normally it doesn't take more than 30 seconds [09:30:53] is there a way to see which one is blocked before it fails? [09:31:09] I can take a look if you want, from neodymium or sarin? [09:31:13] n [09:31:31] if you are going to do ps, I can do that, too [09:31:51] check open ssh connections or from the logs but if you did't run with the debug option I thinknis not logged [09:31:57] netstat [09:32:00] ;) [09:32:18] es1019 [09:33:21] 11.1% (1/9) of nodes failed to execute command 'disable-puppet "..."': es1019.eqiad.wmnet [09:33:49] now it worked [09:34:12] so either cumin or that script have some race condition [09:34:12] how much time a puppet run takes on that host? [09:34:23] I can check [09:35:05] a lot [09:35:15] seems a problem with communication with puppet master then [09:35:27] blocked on "Info: Loading facts" [09:35:33] which normally is very fast [09:35:58] was herron working on puppetmaster? [09:36:59] in general yes, but I don't think anything that could be related in the last few days [09:38:14] the default timeout of disable-puppet is 300 seconds (30 loop with 10s sleep) adjustable by the second parameter [10:00:38] should we depool one multiinstance host when deploying to eqiad? [10:02:38] yeah not a bad idea [10:03:23] weren't you deploying changes to one? [10:03:58] also, are you working with db2059 right now, or is the lag me? [10:04:34] no [10:04:35] it is me [10:04:43] db2059 and db2038 [10:04:45] good [10:04:56] that means I have not broken yet anything :-) [10:05:52] the change looks ok everywhere on codfw, let me check logstash [10:06:14] i was checking it and didn't see anything wrong [10:06:57] oh, so you had repooled db1105:3311 already [10:07:06] yeah, a bit ago :( [10:07:16] db1089 can I enable puppet? [10:07:21] go for it [10:08:44] noop, as expected [10:08:48] :( [10:08:49] :) [10:08:52] the only I worry are the multiinstance ones [10:09:02] do you want to try sanitarium now? [10:09:07] sure [10:09:20] you do it or I do it? :) [10:09:28] whatever you prefer [10:09:32] I am there already [10:09:34] I will do it [10:09:43] db1095 sould be no oop [10:09:52] the other is the one that will change ferm rules [10:10:00] I am trying db1102 [10:10:03] that one [10:10:04] which is multi instance [10:10:26] there is of course also the possiblity of a bug on that specific role, etc. [10:10:53] but that would be a puppet failure, not a big deal, I am only worried about firewall changes [10:11:06] finished [10:11:23] did you saw network going down? [10:11:28] nope [10:11:31] or something strange? [10:11:33] no pings were lost from neodymium [10:11:48] and what does iptables -L say? [10:11:55] regarding mysql, of course [10:12:06] looking good :) [10:12:37] I assume replication is flowing, etc.? [10:12:39] yeah [10:12:44] I was on labsdb1010 and nothing broke [10:12:50] cool [10:13:05] maybe I am getting more conservatibe with the age [10:13:13] *conservative about changes [10:13:26] No, I totally agree with this procedure, specially when chaning network related things :) [10:13:29] so I will do a last test [10:13:31] so +1 to be conservative [10:13:39] depooling a replica multinstance [10:13:48] and then enable pupet everywhere [10:14:21] is anyone on enwiki useful to you or I should do any at random? [10:14:22] sounds good [10:14:34] nah, take any random :) [10:19:54] https://gerrit.wikimedia.org/r/#/c/410135/ [10:29:43] I can confirm ferm changes seem "transactional" [10:29:50] even if rules are deleted and added [10:30:07] so enabling puppet everywhere [10:30:20] well, I will repool db1099 first [11:50:29] I am going to start the manual backups now [11:50:37] Nice! :) [12:04:51] 10DBA, 10Phabricator (2018-02-xx), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3966756 (10Aklapper) [13:34:44] 10DBA, 10Commons, 10MediaWiki-Watchlist, 10Wikidata, and 4 others: Re-enable Wikidata Recent Changes integration on Commons - https://phabricator.wikimedia.org/T179010#3967130 (10WMDE-leszek) [13:53:25] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#2871964 (10Bawolff) Sillu question - isn't the new index pointless? Its adding el_id on the end, but isn't that always on the end of every index,... [14:17:57] hello people, if you have time can we chat about https://phabricator.wikimedia.org/T159423#3955466 ? [14:18:16] just to be sure if we need to order hw or not :) [14:19:18] elukey: hi! [14:19:37] I think 1) yes 2) probably not before Q2 I would say [14:20:30] we can order now, that is not a problem [14:20:41] setting them up is a different issue [14:26:06] oh yes definitely [14:26:27] my questions were only to have an idea about steps to take now [14:31:00] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#3967501 (10Anomie) > I'm going to guess this has to do with doing order by on a partial index being weird. That's right. I wrote a long explanat... [14:31:39] marostegui: do you have a rough idea about what/how-many db host we need to order to replace dbstore1002? [14:31:54] N+1 :-P [14:33:19] elukey: That depends on the HW specification and the plans you guys have for dbstore1002 :-). Are they going to store only core sections (s1-s8?) [14:34:04] Or just replace what we have now with multiinstance and "that's it" [14:34:05] ? [14:34:59] the other question is if they would be open to share infraastructure with m4 [14:35:42] there is quite some available disk space there, and it could help with extra redundancy and better utilization for passive sections [14:36:26] (basically, starting with the minimal purchase needed, so that there is no inefficiencies, while providing redundancy) [14:36:28] so our long term plan is to have everything in hdfs/hadoop and skip mysql, so I'd say that we'd need to replace the current dbstore1002 [14:36:53] but we still don't have a precise idea about the how/when [14:37:00] marostegui: if you understood what I meant, mayve you can translate for me :-D [14:37:27] jynus: I completely got it :) I think that we can keep db110[78] only for m4 [14:37:56] I am proposing not to do that [14:41:33] sure, and I wrote my thought, but I got what you were saying [14:42:52] then it is me I am not understanding [14:45:24] atm we are pushing eventlogging's data (only recent one though) to HDFS via the json refine spark job that Andrew wrote, plus we have two databases with historical (and purged/sanitized) data [14:46:17] eventually it would also be awesome to return db110[78] to you guys, but not sure if we'll be able to do it in the next FY [14:46:24] no no [14:46:37] this is not a borrowing [14:46:55] you bought them, they are yours [14:47:15] if you say "we do not need them anymore" that is ok, they will go to spare [14:47:27] this is what I meant yes [14:47:29] but that has nothing to do with what I am saying or you are asking [14:47:55] what I am saying is that because of that, it is not worth buying 20 machines [14:48:23] and we could share some resources to not buy more than the ones necessary [14:49:01] for example, if wiki A is going to be quariable on db1180 [14:49:21] we can have a copy on db1107, which will only be there just in case [14:49:51] but not really taking much iops, just space, as it seems to be plenty there [14:50:44] or we could continue using dbstore1002 for a while more, in addition to the new purchases [14:50:59] and have redundancy there, but not in active/preferred use [14:51:44] does that make sense? [14:51:46] but it will mean more maintenance burden for yo uguys [14:52:07] no, the plan is to delete dbstore1002 [14:52:08] if this is fine keeping both would be optimal [14:52:23] the problem is keeping it with the current structure [14:52:49] I do not know exactly when it is planned for decommission, we should check that [14:53:39] elukey: to give you an idea, we currently keep a dbstore2002 [14:53:49] suggested replacement for dbstore1002 is 2019-02-21 [14:54:32] ok, then let's plan replacing it and increasing the resources, I do not see the problem here [14:58:54] the only thing that I'd like to know is what to put in the budget request :) [14:59:07] how many hosts and more or less how much they will cost [15:03:20] So, right now dbstore1002 is using 4.6T [15:04:37] @meeting [15:04:55] elukey: people already assumed they won't be able to do joins right? [15:05:10] if that is the case, I would prefer to go for two hosts rather than one [15:06:18] what we have now in dallas is: two dbstores and each of them have some sections, if one goes down, we lose 50% (more or less) [15:07:23] marostegui: I already had a chat with the Research people but I need to talk with the rest of the data analysts. [15:10:04] elukey: if joins are a _must_ then we cannot move away from dbstore1002 model (that is one mysql process replicating all the instances - we call that multisource) until it is migrated to hadoop, which would be a bad thing as we are trying to get rid of multi-source model and going for multiinstance (several independent mysql processes on each server) [15:12:31] nah let's stick with the standard, I need to inform people that joins will not be possible anymore [15:15:06] Cool [15:27:42] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Daniel: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3967672 (10Lydia_Pintscher) Is there anything left to do here or can it be closed? [15:51:53] marostegui: so two hosts? Do we have a rough estimate of the cost? [15:52:57] elukey: That was a suggestion, we can go for two smaller hosts or just a big one like dbstore1002 [15:53:41] elukey: Right now with a standard 4T host, that would not be enough to hold 8 shards for long [15:54:01] (assuming we are going for innodb and not toku) [15:54:39] marostegui: super [15:55:02] As I said, dbstore1002 is using 4.6T, but it is using tokudb of course [15:56:34] so, if we move to innodb, two hosts (standard 4T) wouldn't be enough to keep all the sections (we are experiencing that already on our sanitarium hosts) [16:02:45] I was thinking of buying 3 servers + eventlogging hosts for support [16:03:08] so we had redundancy [16:03:16] yeah, 3 servers sounds good [16:03:22] 3 standard ones? [16:03:32] but we need more space for full redundancy [16:03:46] unless you want to mix them with the provisioning servers [16:04:12] so there is the data redundancy and the service redundancy [16:04:17] That would be up to elukey if they really want to have full redundancy until dbstore1002 is completely gone or not [16:04:33] that was my initial point [16:04:45] I think the min amount of servers to keep the service is 3 (standard ones) [16:04:56] if one goes down, we are losing 30% (more or less) of the service [16:05:06] which is still better than what we have now (100% if dbstore1002 goes down) :) [16:05:22] but if this was a proper server and nothing else existed, not sure it would be enough [16:05:25] *service [16:05:51] note sanitarium works well because it doesn't atend queries [16:06:03] dbstore1002 has heavy querying [16:06:06] elukey: how long you think till dbstore1002 is replaced with hadoop? [16:06:25] plus overhead on us should be justified [16:06:50] e.g. "let's but 10 servers, we don't need if it will be used, but we will just do it" [16:06:58] What I would like to avoid is buying NOT standard servers [16:06:59] it would be a no go [16:07:05] yeah, that +1 [16:07:25] so between not buying anything, and buying 10 severs that will be unused [16:07:32] there is a middle point [16:07:58] and I was adding the possibility of combining with other similar arquitectured services [16:08:10] such as eventlogging or provisioning [16:08:23] not to fund those, they will be separately funded [16:08:33] but to avoid having spare capacity or servers in the future [16:08:51] e.g. if a server is idle "just in case" [16:09:01] we can reuse it for other non-critical service [16:09:27] yeah, so let's be clear for elukey to take a decision [16:09:30] e.g. labsdb1010 is currently idle, but it servers as a failover for analytics and web, and in the future as a second analytics [16:09:42] 3 standard servers for replacement of dbstore1002 is the min we'd need [16:10:43] maybe 2 if conbined with dbstore1002 while it is still up and/or m4 servers ? [16:10:54] and buy a 3rd one later [16:11:06] marostegui: sorry in meetings, will lag a bit in answering. We don't have a precise timeline but I'd say during the next two years for sure [16:11:33] on the other side, buying larger batches tend to be less pricy [16:12:05] e.g. if conbined with the eqiad refresh this Q or the next FY [16:12:52] I still think we should go for 3, as dbstore1002 can fail anytime I would say [16:14:36] +1 [16:14:39] but look at things like the current usage of m4 replica https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1108&var-port=9104 [16:14:43] it is very low [16:15:36] 23 IOPS in average? [16:15:50] let's look at dbstore1002 [16:16:40] yeah - m4 is not used [16:16:52] 4K in writes- probably mostly replication [16:17:03] but top 1.5K reads [16:17:24] note I am not saying "let's not but stuff" [16:17:25] I still prefer to isolate things and not touch m4, but not super strong against it. Not sure if 1 extra host (from 2 to 3) is worth the time we'd need to invest in changing m4 to multi-instance [16:17:29] I am saying [16:17:40] let's do a wise usage of resources [16:18:04] note eventlogging used to be on dbstore1002 on top of the replicas [16:18:12] yeah, I know [16:18:32] I would say to elukey, ask what you need, you set the timelines [16:18:44] put that on a ticket [16:18:51] you do not know the timelines? [16:18:57] then make them up [16:19:09] Yeah, my point is: is it worth investing time on moving m4 to multi-instance if dbstore1002 will be gone or is it easier to buy one extra host and not going into that [16:19:12] and we will discuss what is best for you with a propsal [16:19:39] that is kind of our problem, elukey doesn't need to know [16:19:48] he should state clearly the needs [16:19:59] replace or provide redundancy to dbstore1002 [16:20:07] lifespan [16:20:28] then we can give him a proposal, it is nto worth discusing the details without the big picture [16:20:38] and then we can iterate [16:20:39] I know it is our problem, not saying elukey has to decide if we need to use m4 or not. Just giving my point of view about why I would prefer 3 hosts instead of 2+m4 [16:20:53] 2+ m4 is still 3 hosts [16:21:13] ok: 3 new hosts instead of 2 new hosts +m4 [16:21:19] I don't want to discuss that [16:21:25] ok [16:21:28] until elukey give us [16:21:43] lifespan of dbstore1002 [16:21:46] You started all this discussion I think - I was just giving my point of view :) [16:21:49] lifespan of eventlogging [16:21:57] ok, my fault [16:22:07] noone's fault! [16:22:24] I was just giving my point of view for this discussion, that's all [16:22:30] and lifespan of dbstore replacement ot redundancy (and which of the 2) [16:22:51] he will put that in a ticket [16:23:01] and then we can kill each other :-D [16:23:21] I will just hide between 3 NEW SERVERS and you'll not find me :) [16:24:09] if they are as busy as m4 ones, I will find you easily [16:24:14] hahaha [16:27:27] so regarding the general idea- yes, different services on different servers [16:27:53] but I do not like idle "backup services" [16:28:55] and spending a huge amount of money for a server to be idle- that is why we could combine stuff that is passive, or only needs data redundancy but not service redudancy [16:29:05] that is my general idea [16:29:44] our blocker here is that we do not know yet how active are some things going to be, based on their idea for replacement [16:29:53] In general me neither, but in this case 1 extra server saves us from touching and changing m4 to multiinstance (not sure if that needs lots of changing from analytics side) for a service that will be gone in a timeframe of 2 years as elukey [16:30:10] Going to call it a day, and let luca do his homework :) [16:30:18] well, if we were to add m4 to the pool [16:30:26] I would just set it on a separate port [16:30:39] no changing, just adding, no extra work [16:30:43] big IF [16:30:54] we'd need to pupetize that [16:30:59] Anyways, not going into that now [16:31:00] so both eventlogging and dbstore hosts should have a maximum lifespan, if everything goes as planned, of say two years [16:31:47] as long as the m4 master is not on the same host as other ones, I am fine with co-locating [16:31:56] so the question here is if to buy a new server for just 1 year [16:32:10] not the replacement, that is not a discussion [16:32:21] it would be reused as soon as the service is killed [16:32:29] yep this is the ide [16:32:31] *idea [16:32:33] but dbstore1002 -> 2 redudnancy [16:32:55] ? [16:32:57] "reused" seems like you want to steal it :-) [16:33:08] once it is no longer in use [16:33:13] Well, yeah, my point is that it is not a waste of money [16:33:37] elukey: put that last thought on a ticket [16:33:48] I will fight manuel tomorrow on the ticket [16:33:49] sure, in a meeting now but I'll do it asap [16:33:54] :D [16:33:56] thanks! [16:33:59] I am going to logoff [16:34:02] see you tomorrow guys! [16:34:21] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3967992 (10Marostegui) [16:34:30] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 2 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#3967993 (10Marostegui) [16:34:43] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#3967994 (10Marostegui) [19:55:39] 10DBA, 10Phabricator (2018-02-xx), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3968928 (10mmodell) >>! In T187143#3966122, @Marostegui wrote: > Midnight UTC would be impossible for me > Thursday UTC morning I could do it (I c... [21:38:21] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Daniel: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3969450 (10Ladsgroup) Well, enabling it on all wikis is one thing that we need to do and with {T185693} I think it's fine to move forward... [21:41:53] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3969463 (10Cmjohnson) [21:55:59] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3969490 (10Cmjohnson) [22:26:36] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#3891821 (10Niharika) Hey @jcrespo, we're done with the security review for the extension and are hoping to go to beta...