[05:50:28] 10DBA, 10Jade, 10Operations, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Krinkle) [05:52:38] 10DBA, 10Jade, 10Operations, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Krinkle) ## 16 Jan 2016 - Draft phabricator reply regarding Jade I'm writing to summarise the meeting with the Scoring team ab... [06:00:26] 10DBA, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Watching / External), and 2 others: Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10Marostegui) Yes, it was, as shown at: T212154#4862734 ` root@db1065.eqiad.wmnet[mysql]> select u... [06:03:02] So jynus you want to disable gtid, do the topology changes and enable gtid? [06:15:09] jynus: I am ready to do topology changes [06:18:15] so I wanted to enable gtid on the unrelated replicas only [06:18:40] to not leave those not-crash safe for long [06:18:43] cool, so leaving aside db1078 only, right? [06:18:51] yes, and db1075 [06:18:56] yeah [06:19:02] but it is not a big deal [06:19:09] sounds like a good idea, I will do that once the topology is changed [06:19:15] I am happy with only enabling it on one host [06:19:15] Any objections to go ahead and start changing things? [06:19:32] no, just that I needed my cofee [06:19:36] haha [06:19:38] go and get it [06:19:39] and I got it [06:19:41] I will start doing them [06:20:07] on the next iteration I will handle replica movement with zarcillo automatically [06:20:20] sounds good, we have to do it at some point :) [06:20:32] I would like to trigger the switchover.py script this time [06:20:35] if you are ok with it [06:21:00] you mean yourself? [06:21:05] yeah [06:21:19] ok, but paste the output when done [06:21:23] sure thing [06:21:38] it is ok if you do it yourself too! just say it :) [06:28:36] do we switch some variables? [06:28:52] what do you mean? [06:29:05] line 10 to 18 [06:29:25] I think line 11 we can do once it is slave [06:29:34] Same for 12-17, no? [06:29:45] 17 I was looking at [06:29:55] sure, we can [06:33:52] jynus: going to merge: https://office.wikimedia.org/wiki/User:CRusnov_(WMF) [06:33:57] not that :) [06:34:03] but: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484612/ [06:34:43] puppet is stopped [06:34:47] on both [06:34:49] so we are good to merge [06:36:05] ok, go on [06:38:14] make sure you run it with --skip-slave-move or things will just go badly [06:38:36] yeah, this is what I would run: switchover.py --skip-slave-move db1075 db1078 [06:38:43] I am copying it from the etherpad directly [06:38:51] do you prefer to do it yourself? [06:38:58] (i am fine with that!) [06:41:29] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10Marostegui) Where is this database? I cannot find it on the misc masters. ` root@cumin1001:~# for i in db1063 db1065 db1072 db1073; do mysql.py -h$i -e "show databases like '%fr%'";done root@cum... [06:45:20] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10jcrespo) It should be on m2: https://wikitech.wikimedia.org/wiki/MariaDB/misc#Current_schemas_2 [06:47:33] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10Marostegui) So looks like it is only on codfw hosts, which indicates it is not really in use: ` root@cumin1001:/home/marostegui# for i in db1065 db1117:3322 db2044 db2078:3322; do echo $i; mysql... [06:51:58] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10jcrespo) our backup systems has this bug where empty dbs are not recovered, maybe it is an empty and was deleted by accident (with no data loss?). [06:52:36] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10Marostegui) it is empty actually ` root@db2044:/srv/sqldata/frimpressions# ls -lh total 4.0K -rw-rw---- 1 mysql mysql 54 Jun 4 2014 db.opt ` [06:53:57] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10jcrespo) awight - do you need its contents? Maybe it was archived in the past, we would have to do some research about that. [07:15:23] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) [07:15:50] so do you trust my script more now? [07:16:00] :-) [07:16:18] totally! <3 [07:16:19] needs more work still [07:16:58] I always trusted it, just wanted to use it myself :) [07:30:16] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10Marostegui) I think this database was just created but never actually populated: T83011#908236 [07:32:20] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) [07:32:25] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) 05Open→03Resolved a:03Marostegui This was done: Read only ON at: 07:01:00 Read only OFF at: 07:04:20 Total time read only time: 03:20 minutes If you see something... [07:33:50] o/ [07:33:50] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) [07:33:56] yes, a package [07:34:10] ok, you want to do that (and upgrade it) and I take care of the a3 maintenance? [07:34:15] review what's pending from our side [07:34:19] and depool whatever we need [07:34:26] I don't care, just I don't want you to do everthing [07:34:37] ok, you do db1075 and I do the other rack [07:34:38] :-) [07:34:41] :) [07:35:26] ok then prepare a patch and I will review it [07:35:42] consider increasing weight of db1123 [07:35:59] yeah [07:36:01] if we are going to put 75 depooled for a long time [07:36:09] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Peachey88) [07:36:54] if we need to do a package+upgrade db1075, yeah, I will increase db1123 now [07:37:22] well [07:37:26] mostly for later [07:37:40] but that too, just that that only requires a few minutes [07:38:00] that is why I was asking [07:38:08] do you want to depool the servers now [07:38:42] the other rack's one? yeah I am going to review those [07:38:50] https://gerrit.wikimedia.org/r/484864 [07:39:20] ok add me as a reviewer when you have something I can do [07:39:25] I will work on db1075 [07:39:29] cool [07:39:36] ok with that db1123 increase? [07:39:44] there are some backups alerts too, btw [07:39:51] I saw [07:39:54] which was kinda expected as you mentioned yesterday [07:40:03] although eqiad ones finished correctly [07:40:06] so not too worried [07:40:10] ok [07:40:18] we should setup some high availability [07:40:21] for metadata [07:40:43] yeah, we could replicate only zarcillo to codfw maybe [07:40:49] we can discuss that later [07:40:58] I am going to merge db1123 increase? you ok with that? [07:41:03] ah [07:41:05] you just +1 [07:41:06] thanks! [07:43:38] 10DBA, 10Operations, 10Patch-For-Review: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @anomie you can restart s3 migration script [07:45:02] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) db1075, s3 primary master, was failed over to db1078 which is in row C. [07:46:27] 10DBA, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Watching / External), and 2 others: Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10Marostegui) Just to confirm, the slave (which I think you don't use) also has the change applied... [07:49:53] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/484872/ [07:52:00] is that really all? [07:52:09] seems less impact that I thought [07:52:15] a2 has a lot more hosts [07:52:18] what about es hosts? [07:52:20] but today it is only a3 apparently [07:52:23] oh [07:52:30] I didn't knew that [07:52:42] I will read the task [07:52:49] do you want to merge soon? [07:53:13] I guess if it is only that it can wait? [07:53:35] paravoid: today it is only a3, right? [07:54:00] the original task was s2 [07:54:06] but maybe plans changed [07:54:11] that is why I don't know [07:54:19] yeah, but on the hangouts we had, I think it was said that a3 was in a worse state than a2 [07:55:22] a2 (T213748) also said Thu 17, but backup date Tuesday, which I think it was the thing we agreed on [07:55:23] T213748: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 [07:56:04] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) @RobH is this happening today too along with a3 maintenance or is this finally moved to Tue 22nd? [08:28:41] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [08:29:31] jynus, marostegui: fyi, I'm deploying the systemd updates to the stretch-based DB roles in ~10-15 mins. it's working fine on almost 800 servers already, so I don't expect any issues, but there might be the odd puppet/dpkg alert spam (which I'll look after) [08:29:43] thank you [08:30:33] moritzm: I think I already deployed those to some already [08:30:42] Morning all! :) [08:31:30] jynus: ack, the update itself is unproblematic and I've tested it with various combinations, it's mostly about alert spam et al. and to keep you in the loop [08:31:43] thanks, moritzm [08:33:43] marostegui: apparently there is was Replicate_Wild_Ignore_Table options on db1075 [08:33:57] but only for the stuff that was moved to s5 [08:34:12] yep. hopefult that caused no issues [08:34:36] does that get added despite reset slave all? [08:34:48] should we check s5 master? [08:35:01] interesting, true, we did a reset slave all [08:35:14] there is nothing on s5 master [08:35:16] or maybe it was on configuration? [08:35:23] but puppet ran [08:35:34] i mean, it has been running since we did that change [08:35:43] let's check the logs [08:35:52] this is what I would do [08:36:05] do a pass on al servers with replication enabled [08:36:23] and run change master on the ones that aren't to see if it appears [08:36:27] or check the variable [08:36:55] yeah, it shows on show global variables like 'Replicate_Wild_Ignore_Table'; [08:37:08] checking the variable should be enough for the ones without replication [08:37:32] but the fact that it survives reset slave all is scary [08:37:38] Jan 17 07:01:32 db1075 mysqld[2475]: 2019-01-17 7:01:32 139334214588160 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='db1078.eqiad.wmnet', master_port='3306', master_log_file='db1078-bin.000165', master_log_pos='412441960'. [08:38:01] yeah it takes the variable from global status [08:38:20] on restart, the variable will get reseted [08:38:26] but it will presist on slave.status [08:39:36] not ideal because it gets sticky with no easy way to unsticky it [08:39:48] yeah, not nice [08:40:34] I guess reset slave all; eplicate_Wild_Ignore_Table='' as standard [08:40:39] yeah [08:40:45] otherwise it can cause issues [08:40:47] serious issues [08:40:50] because reset slave and reset slave all are not enough [08:41:00] maybe we should propose reset slave all all; [08:41:34] haha I was searching on bugs.mysql to see if there is a related bug on this [08:41:43] mariadb [08:41:52] although we should check mysql behaviour [08:42:01] which implemented more recently the same variable [08:42:08] dynamically [08:45:02] I checked 8.0 doc and it doesn't mention anything about being persistent [08:45:08] they implemented CHANGE REPLICATION filter [08:45:18] so maybe it gets wiped after a reset slave all after all [08:45:24] I see, not as a variable there? [08:45:32] which is what it would make sense [08:45:48] mariadb behaviour is consistent but missleading [08:45:49] https://dev.mysql.com/doc/refman/8.0/en/replication-options-slave.html#option_mysqld_replicate-wild-ignore-table [08:47:02] and mariadb also doesn't mention it will persist https://mariadb.com/kb/en/library/replication-filters/ [08:47:11] so maybe you found a bug :) [08:47:17] well, it is a variable, it us supposed to persist [08:47:38] yeah, but I wouldn't expect that at all after a reset slave all [08:47:52] I will just rant on twitter about it [08:48:28] hahaha [08:48:53] https://twitter.com/hashtag/mysql8implementationismorerobust [08:56:43] hahaha [08:57:07] so s4 and s6 backups are indeed fine, should we just ack the alert or should I insert a row on zarcillo? [08:57:42] I only checked eqiad [08:57:48] thouse can be down [08:57:55] yes, I checked eqiad only first but they are ok [08:57:58] meaning they were done [08:58:15] I will check s2 codfw after those [08:58:45] i can insert a line on zarcillo or ack the alert [09:00:33] just down for 8 days [09:00:54] down as downtimed? [09:01:10] downtime until thursday 00:00 [09:01:14] gotcha [09:01:47] done [09:01:49] will check codfw now [09:01:50] maybe we shoud revisit metadata gathering if the model is unreliable [09:03:15] s2 codfw is also good [09:12:15] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [09:52:33] marostegui: so I'd say to depool and stop all a2 hosts too as we didn't get an answer [09:53:23] let's wait a bit more to see if we get an answer from paravoid [09:53:28] ok [09:53:29] I will prepare the patch so you can review it [09:54:09] I am not too worried about the service but about the data [09:54:14] yeah yeah [09:54:19] but there are lots of hosts, so the patch will be long [09:54:25] so at least it can be reviewed already [09:54:52] elukey: ^ that also affects db1107 [09:58:03] marostegui: sorry I didn't get what [09:58:29] a2 hosts? [09:58:37] elukey: if we are doing a2 rack maintenance (which we still don't know) db1107 is eventlogging [09:58:45] ah yes I know [09:58:53] I added info to the task [09:58:59] I don't think we are doing a2 today, but we didn't get confirmation about it (either yes or no) [09:59:08] I'll stop eventlogging and replication before maintenance (if we do it) [10:15:36] I am reviewing [10:17:24] no rush :) [10:17:56] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [11:02:26] elukey: I will leave db1107 entirely for you (I downtimed it though) [11:03:23] sure, I'll also downtime db1108 so stopping replication will not cause alerts [11:10:13] marostegui: all done except that the eventlogging sanitization script is running on db1107 now.. I am sure it should finish in a bit, so I'd prefer not to kill it now [11:10:19] it is a daily systemd timer [11:10:24] cool - thanks [11:10:25] we are in a meetnig [11:10:29] okok [11:43:43] 10DBA, 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10Marostegui) All the systems owned by the DBAs are now off. [11:45:43] so what time is the thing, in 15 minutes? [11:45:44] jynus: I have stopped and powered off db1075 [11:45:48] thanks [11:45:52] I forgot [11:45:53] All hosts are now off [11:45:58] we are good for a2 and a3 [11:46:34] errors on db1068 keep decreasing \o/ [11:46:44] saw it [12:06:28] marostegui: db1107 is ok to shutdown [12:06:38] you do that? [12:06:43] I can yes [12:06:49] excellent! [12:06:51] thank you [12:07:02] mysql stop + regular shutdown right? nothing special [12:07:10] we normally umount /srv [12:07:15] after the mysql stop [12:07:44] ignorant qs - isn't it done by the os while shutting down? [12:08:02] yeah, it should, but it doesn't hurt to do it manually just in case [12:08:09] sure sure [12:08:48] elukey: there is 2 reasons [12:08:51] one, it could timeout [12:09:07] and more important, it helps us identify if there is something running on /srv [12:09:15] ack [12:09:19] sometimes manuel leaves open screens [12:09:38] that way I make sure I don't kill them without knowing [12:09:45] it won't be needed for you, really [12:10:10] but e.g. dbstore1002 takes 1 hour to shutdown, so there are issues, etc. [12:10:57] makses sense [12:11:12] also, being a master, it won't have any issue if it crashes [12:14:17] db1074 is complaining [12:14:27] from mwmaint [12:14:37] yes, cron? [12:14:48] yeah, I guess it was a cron not re-reading the config [12:14:54] is it the vslow host? [12:15:01] no [12:15:02] api [12:15:05] but it is depooled [12:15:09] interesting [12:15:11] it has been depooled for an hour or so [12:16:09] what could it be? [12:16:24] I know [12:16:28] migrateActors [12:16:39] let's kill it on s2 [12:16:51] php -ddisplay_errors=On /srv/mediawiki/multiversion/MWScript.php migrateActors.php --wiki=ptwiki --batch-size 2000 [12:16:54] yeah [12:17:09] 161920 [12:17:15] I will update anomies ticket while you wkill it? [12:17:22] ok [12:17:45] Or I can kill it [12:17:48] I killed s3 yesterday [12:17:51] So I can do s2 [12:17:52] I killed it [12:17:55] ah thanks [12:19:49] https://phabricator.wikimedia.org/T188327#4887781 [12:20:40] I am glad we killed the one for s3 too, otherwise…I guess it would have tried to keep using db1075 [12:25:47] yeah [14:14:30] 10DBA, 10Serbian-Sites: Mass bigdeletion scheduled for sr.wikinews - https://phabricator.wikimedia.org/T212346 (10Zoranzoki21) 05Stalled→03Open Noone was not opposed, reopening. [14:20:33] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10Marostegui) 05Open→03Resolved a:03Marostegui I have dropped this empty database and updated documentation. [14:23:52] 10DBA, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Watching / External), and 2 others: Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10Marostegui) >>! In T212154#4886394, @Pchelolo wrote: > Hm, trying to deploy the service again I... [14:30:17] 10DBA, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Watching / External), and 2 others: Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10Marostegui) 05Open→03Resolved It now matches https://gerrit.wikimedia.org/r/#/c/operations/p... [17:02:39] jynus: all databases on a2 are back, I am waiting for confirmation that the maintenance is finished, to start starting mysqls [17:02:57] Yeah I saw it [17:03:04] But I think we should wait till tomorrow morning to repool them [17:03:39] Tomorrow morning I can issue a revert to the gerrit patch and repool all of them [17:04:13] db1075 was not part of the patch [17:04:18] yeah [17:04:22] And db1103 [17:04:26] So I can do those after that [17:06:56] do I kill dbstore1002? [17:07:53] no [17:07:56] it is doing recovery [17:08:18] (or should be) [17:08:39] I am not sure about that [17:08:45] both innodb and toku finished that [17:08:52] mmm looks like it finished even loading the pool [17:08:52] even loading the buffer pool did [17:09:11] yeah, last time I tried to stop it it didn't work, I had to kill mysqld_safe [17:09:15] I think I will do it again :( [17:09:32] try sigint first [17:09:39] yeah, that didn't work first time [17:09:45] it takes a lot of time [17:09:57] but it should at least try to stop [17:10:09] even if it gets freezed [17:10:20] I just did, we will see [17:10:36] Meanwhile, I am going to start starting mysqls [17:10:39] it didn't answer [17:10:39] ok? [17:10:45] can I try? [17:10:49] go ahead [17:11:33] there is an overload witch check_mariadb.pl [17:11:39] yep, saw that [17:11:47] cause it gets stuck [17:11:56] there yo go [17:11:58] it is now up [17:12:03] did you do anything? [17:12:06] it is now up [17:12:40] now it went down [17:12:43] right after it went up [17:12:52] 190117 17:11:52 [Note] /opt/wmf-mariadb10/bin/mysqld: ready for connections. [17:12:55] Version: '10.0.22-MariaDB' socket: '/run/mysqld/mysqld.sock' port: 3306 MariaDB Server [17:12:58] 190117 17:12:32 [Note] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown [17:13:01] was that you? [17:13:05] I killed perl [17:13:12] and then I sigint mysqld [17:13:21] whether it worked or not I cannot tel [17:13:24] so it came up? [17:13:26] weird [17:13:42] maybe it was a coincidence [17:13:45] maybe [17:14:25] I tend to have good luck on these things [17:14:29] :-) [17:14:29] XDDDDD [17:14:57] I will just play it cool and tell you of course it was me [17:15:01] hahaha [17:15:04] and I will believe it! [17:15:08] you were just missing the secret sauce [17:15:32] let me restart it, ok? [17:15:40] go for it [17:15:44] I am starting mysql on the other hosts [17:16:46] doesn't start properly [17:17:21] 10DBA, 10Fundraising-Backlog: Remove frimpressions db from prod mysql - https://phabricator.wikimedia.org/T213973 (10cwdent) @Marostegui @jcrespo thanks! [17:18:09] it is starting now, no? [17:18:16] it is loading the buffer pool I guess? [17:19:09] all perl processes again [17:22:10] so it took around 30 minutes last time to fully boot up [17:22:15] let's see this time... [17:33:26] dbstore1002 back up [17:34:36] replication working fine [17:36:00] jynus: as everything is fine I am going off. All the DBs in a2 are still catching up, I will repool them tomorrow morning [17:36:04] marostegui: shall I restart the ariab->innodb conversion ? [17:36:07] Thanks for all the work today! [17:36:37] elukey: let's let it catch up a bit tonight, as it is quite delayed because of the migration script from anomie [17:36:45] sure [17:39:30] marostegui: one last question, do I review downtimes? [17:39:41] Let me see [17:39:43] just in case? [17:39:52] I can do that, I am not asking you :-) [17:40:06] let me downtime it for 20 hours [17:40:21] cause the downtime expires at 19 UTC [17:40:21] I can do that, don't worry [17:40:33] i have the one liner :) [17:40:37] ok [17:40:38] right in front of me [17:40:40] how many hours [17:40:44] maybe 14 hours? [17:40:49] share it! [17:40:53] haha [17:41:07] for i in `cat a2`; do icinga-downtime -h $i -d 14400 -r "T213748" ; sleep 5; done [17:41:08] T213748: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 [17:41:10] XDDD [17:41:20] a2 is at /home/marostegui [17:41:48] 50400 -> that is 14 hours [17:42:00] I gave robh an update of upcoming purchases [17:42:06] ah thank you [17:42:23] we need to do some decision on provisioning [17:42:31] and sync with alex on bacula [17:42:35] downtimed 14h [17:42:42] I am going to go off [17:42:52] thank you marostegui [17:42:56] thank you! [17:43:04] I willl give a review of what is going on [17:43:10] and will go soon too [17:43:15] you better do [17:43:59] bye [18:11:17] 10DBA: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10RobH) [18:11:24] 10DBA: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10RobH) [19:50:11] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [19:50:44] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Ottomata thanks! I've updated the task description and ping the groups you mentioned. [19:51:10] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [19:52:15] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Banyek and @Dzahn I'd appreciate your input on this task. Thank you. [20:01:03] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > We should have a clear separation of concerns and while the hadoop cluster is in charge of computing the data the t... [20:06:42] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) How to install the importer scripts is what i started once in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476... [20:21:43] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) >It has been abandoned after Analytics said to not use stat hosts and use Hadoop instead. To clarify: stats machines sho... [20:24:27] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) > to have a daemon on the mysql hosts To clarify, it is unlikely these scripts would run on the mysql servers themsel... [20:31:21] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) I think one telling use case the ilustrates why we want to decouple data loading from hadoop is a rollback. Say that yo... [20:37:18] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) Rollback is already taken care of the in the script level. We'll have different versions of the data in MySQL and ca... [20:46:02] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) @bmansurov how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There... [20:52:01] 10DBA, 10Analytics, 10Operations, 10Research, 10Article-Recommendation: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Nuria > how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? T... [20:56:07] 10DBA, 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) 05Open→03Resolved a:03RobH Synced up with Chris via IRC: All systems were able to come back up within a2 without incident. The spare PDU is...