[00:01:31] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#2546917 (10bd808) DNS switch announced: https://lists.wikimedia.org/pipermail/cloud-announce/2017-... [00:05:29] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3836277 (10bd808) @Marostegui and @jcrespo: Could one of you please stop replication on labsdb1003... [01:31:48] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#2546917 (10dschwen) Death blow for GHEL coordinate extraction and WikiMiniAtlas. 🙁 [05:32:16] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3836579 (10jcrespo) p:05Normal>03High This should be a top priority because T181731#3836576 in particular db1096, db1097 and db1101 (dewiki.archive). [05:39:30] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3836584 (10jcrespo) Also dbstore1002 (maybe dbstore1001, too). [06:32:58] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3836591 (10Marostegui) a:03Marostegui [06:51:27] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3836604 (10Marostegui) Looks like only the recentchanges hosts broke as per the logs: ``` 04:23 < icinga-wm> PROBLEM - MariaDB Slave SQL: s5 on db1097 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, E... [07:24:01] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3836625 (10Marostegui) >>! In T142807#3836277, @bd808 wrote: > @Marostegui and @jcrespo: Could one... [07:25:24] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3836626 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [07:37:12] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3836633 (10Marostegui) [07:58:52] https://gerrit.wikimedia.org/r/#/c/398222/ [08:11:26] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3836666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [08:12:18] volans: the reimaged failed this time with: [08:12:25] 08:11:20 | db1111.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: Warning: Setting configtimeout is deprecated. [08:12:28] (at /usr/lib/ruby/vendor_ruby/puppet/settings.rb:1146:in `issue_deprecation_warning') [08:12:31] /usr/local/share/bash/puppet-common.sh: line 68: /var [08:12:33] XDDDDD [08:21:18] :D [08:24:55] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3836676 (10Marostegui) [08:30:16] I've seen that on installed hosts during puppet runs as well [08:30:47] not sure if that is related to the puppet migration? _joe_? [08:44:05] yeah, that's related to the puppet migration, I think I saw it mentioned on IRC, let me check [08:48:30] marostegui jynus We recently deployed a patch that changes the way ores extension interacts with the database (it's live in wikidatwiki now but will go live tonight), is there anything unusual happening due to it? [08:49:02] marostegui: yeah, related to puppet migration: https://phabricator.wikimedia.org/T182585 [08:49:16] Amir1: we had an s5 outage tonight [08:49:27] but I do not think it is related to that [08:50:47] oh okay, please keep me posted specially it should reduce the load on the database if it's not happening or even make it worse, I need to rework that [08:50:58] thanks moritzm - just commented on that ticket :) [08:51:25] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3836737 (10Marostegui) p:05Triage>03Normal [08:52:40] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3836729 (10Marostegui) a:03Cmjohnson @Cmjohnson please get this disk replaced whenever you can Thanks! [09:03:51] marostegui: checking logs [09:04:44] don't know if you might want to add an exception to that, if it is related to the migration [09:06:01] ahhh damn it, is the deprecation warning :( [09:08:18] sending a fix [09:09:19] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3836763 (10jcrespo) @bd808 could we change the old servers to point to the analytics hosts instead... [09:09:23] \o/ [09:10:32] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3836764 (10Marostegui) [09:17:59] marostegui: I'm digging a bit more because I cannot repro the same output on the current system, it printed also [09:18:09] /usr/local/share/bash/puppet-common.sh: line 68: /var [09:19:50] sorry wrong paste [09:19:53] /var/lib/puppet/state/last_run_summary.yaml: No such file or directory [09:21:07] :| [09:22:30] and ofc the file is there now and the commands work :D [09:23:07] I can re-issue a reimage if that would help [09:23:39] I'm assuming the file is not there anymore now while it was before [09:23:47] and patch accordingly ;) [09:30:18] marostegui: what are you doing with the depools? checksum? provisioning? other fixes? [09:30:32] the s5 ones? [09:30:38] the ones today [09:30:45] s5 -> checksum [09:30:47] db1091 -> alter [09:30:53] s5-s8 [09:30:59] 91 is which replica set? [09:31:04] s4 [09:31:07] ok [09:31:11] I just want to know [09:31:15] sure :) [09:31:27] to make sure I do not step onto you [09:32:07] I wonder which of the fires I should attend first [09:32:38] which ones do we have on the table right now? [09:32:40] if you are with s5, maybe I should try to fix misc? any other suggestion? [09:32:47] I will take care of s5 yes [09:32:55] Fixing misc is a good idea indeed [09:32:58] or maybe I can help with s5? [09:33:10] whatever you prefer, too many cooks are not good either [09:33:20] s5 is still dumping [09:33:32] and I assume we will not be able to do more than 2 at a time? [09:33:38] yeah [09:33:56] My guess is that we only have differences on the rc slaves [09:34:01] as none of the others broke [09:34:07] only rc slaves if I am not mistaken [09:34:09] actually, 3 broke [09:34:14] plus dbstores [09:34:18] yeah, dbstores... [09:34:47] it is difficult to say because we didn't failover to a copy of the old master [09:34:58] yep [09:36:00] I will try to make a plan of the pending misc, which is blocking decom, which is blocking long-term planning [09:36:06] haha [09:36:08] good! [09:36:13] misc themselves are not that important [09:36:32] but the get in the way of other things like "how many servers we need?" [09:48:44] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3836799 (10Marostegui) [10:24:58] marostegui: patches sent, if you want the TL;DR I can do it here ;) [10:25:10] let me seeee [10:27:40] Ah, I see [10:27:58] Makes sense to me, once the migration is completed, that will not be needed I assume? [10:28:18] the dev/null hopefully not, should be temporary [10:28:41] the other CR and the change from grep+awk to simple awk are instead to fail properly if the file is not there [10:29:07] thing that I cannot yet explain unless you were so unlucky to pick the race condition that puppet was overwriting the file [10:29:24] I don't believe it much because the file is quite small [10:29:39] so maybe there is a different behaviour of the new puppet client that I'm not aware of [10:30:32] We can try again once you merge and see what I get :) [10:31:18] fair enough! merging both of them [10:31:50] thanks for being a beta-tester [10:32:08] hahaha [10:38:18] marostegui: you can kick the reimage whenever you want (hoping it works this time) [10:38:28] ok [10:38:30] let me do it [10:38:32] if possible [10:38:41] use -d [10:38:51] sure [10:38:53] so just in case I'll have a bit more details [10:38:54] thanks ;) [10:39:06] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3836885 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [10:47:40] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3836902 (10Marostegui) s4 is all done but the master. I will alter the master after Christmas. [10:47:58] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3836904 (10Marostegui) [10:48:44] jynus: I would like to start deploying ^ alter table on either s1 or s7 slaves, but not sure if you are planning stuff with any of those shards :) [10:48:51] I don't mind which one [10:50:52] I was touching s1 [10:51:04] but I do not think I will do more of those until next week [10:51:19] do you have a tracking list? [10:51:24] yeah [10:51:38] I mean, not for s7 and s1 (yet) as I was trying to decide [10:51:43] As soon as decide, yes :) [10:51:49] s2, s3, s4(almost), s6 done? [10:51:55] indeed! [10:52:04] so only s7 and s1 left, aside from s5/8? [10:52:23] go for s7 unless you are also doing checksums there [10:52:45] although I would stop s7 checksumming , do s5 instead and go for s7 checksums [10:52:50] *s7 alters [10:53:05] I am doing s5 already :) [10:53:06] just an idea, of course [10:53:15] (checksumming I mean) [10:53:22] don't fill too many things on your plate [10:53:28] I was doing s7 while s5 was being dumped [10:53:37] but it is already done, so I am stopping checksumming s7 [10:53:44] and I am doing dewiki.archive right now [10:53:51] cool [10:53:53] as that is the top priority as it broke [10:53:57] but I would do some alters in the background [10:54:54] I think I would start with s1 [10:55:01] so we can have an idea how long that one will take [10:56:37] hey guys [10:56:44] something important just came up :( [10:56:50] is it possible to move the meeting to later in the afternoon? [10:56:58] fine by me [10:57:02] jaime: ? [10:57:21] even tomorrow if it is easier [10:57:27] up to you and jynus [10:57:38] let's try for today so we can post the goal but if needed we can do tomorrow yes [10:57:45] ok :) [10:58:23] I would prefer today [10:58:39] sure, anytime works for me [10:58:45] what works for you jaime? [10:58:45] doubts about goal based on greenlight for provisioning [10:58:55] "greenlight"? [10:58:57] we can have a proper meeting othertime [10:59:06] mark: there is stuff that happened [10:59:11] will talk later [10:59:24] ...when? [10:59:49] when can you? [11:00:07] I think between 14:00 and 16:00 is best [11:00:08] I can anytime to day, ping me when available [11:00:12] or after 17:00 [11:00:21] I will tell you the speficic goal-related issue [11:00:27] i can't wait : [11:00:27] and we can have a proper meeting another time [11:00:28] :) [11:00:37] let's do it from 15:00-16:00? [11:00:41] ok [11:00:42] 15:00 CET? [11:00:54] yes [11:00:56] marostegui: this is just in case it is bad for you [11:01:03] I prefer you have it you, too [11:01:15] I will be there :) [11:01:25] marostegui: should we meet now? [11:01:37] in case you'd want me too in the meeting feel free to ping me ;) [11:01:50] up to you, too [11:02:28] jynus: I think we can just meet at 3 then [11:02:32] at 15:00 I mean [11:02:32] ok [11:02:35] ok :) [11:02:40] unless you want to discuss something [11:02:59] nope, it was just the unexpected purchases and the plan we discussed on Monday? [11:03:12] sure, I think we can do it at 3 then with mark :) [11:03:14] you know what I am talking about? [11:03:16] perfect [11:03:19] yep [11:03:23] you are wonderfully mysterious jaime :) [11:03:31] haha [11:03:42] it is too long to summarize, I will let manuel explain it [11:03:47] I think I will go for a run now then and get some exercise before lunch [11:03:50] jynus: haha thanks [11:05:36] mark: basically this: https://phabricator.wikimedia.org/T179191 [11:06:25] 6 servers [11:06:36] 3 x identical pairs [11:07:15] I was lettin you exaplain at the meeting [11:08:15] Yeah, I was giving some context so he can know what we were talking about [11:08:17] we can discuss in the meeting [11:08:39] 2x for sanitarium [11:08:40] 2x for MCR testing [11:08:40] and 2 for external storage backups [11:08:40] yup [11:36:12] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3837011 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` and were **ALL** successful. [11:53:59] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3837050 (10Marostegui) [12:19:14] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3837100 (10Marostegui) I have fixed dewiki.archive across the board. Next on the list: dewiki.change_tag dewiki.geo_tags dewiki.old_image dewiki.tag_summary Once those are done, I will do another checksum to verify. [12:20:03] were there many rows on dewiki.archive? [12:20:18] I suppose not many on those servers [12:20:22] No, not a lot of them [12:20:37] you haven't reached the problematic ones, I think [12:21:39] We will see in the next iterations.. [12:26:04] I will leave db1, es1 pc1 and labsdb1 with puppet disabled for a while [12:26:21] to try to see changes or breakages on codfw first [12:26:31] noop or expected changes so far [12:26:49] sounds good :) [12:26:59] but ping me for individual servers [12:27:06] will do :) [12:44:59] db1104 failed, is that something you are handling? [12:45:09] mmm [12:45:10] checking [12:45:26] ah [12:45:30] it is ok [12:45:30] puppet you mean? [12:45:37] reenabling it [12:45:42] yeah [12:45:42] that is ok [12:45:47] just in case it was me [12:45:49] I stopped it in the morning :) [12:45:56] leaving it alone [12:45:59] no no [12:46:02] you can enable it if you like [12:46:41] not needed? [12:46:46] not anymore :) [12:47:36] maybe we should disable tls on my.cnf and change mysql as an alias to enforce it only on remote connection? [12:47:44] if that is the problem [12:48:03] we can do that, it was just for mydumper [12:48:11] but we do normally use —skip-ssl on localhost [12:48:15] so it might not be a idea after all [12:50:28] 10DBA, 10Epic, 10Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#3837177 (10jcrespo) 05Open>03Resolved a:03jcrespo Finally https://gerrit.wikimedia.org/r/394541 was deployed- but there is still many things pending to refactor. [12:55:56] jynus: https://gerrit.wikimedia.org/r/#/c/398249/ ok with you? [12:57:32] yes, although be careful with large enwiki depools [12:57:58] somtimes they generate connection errors, get kibana handy [12:58:08] yup [12:59:19] there are still queries ongoing on labsdb1011 [12:59:43] do you know if haproxy failovers are hard or soft? [12:59:52] I would assume soft [13:00:04] how long have they been running for now? [13:00:12] apparently 10 hours [13:00:20] lovely [13:00:22] but there are some 0 connections [13:00:37] I am going to maybe take the time to luch, see what I do later [13:00:44] ok [13:00:59] I do see connections on 1010 [13:01:00] in theory the policy is "on connection error, retry once" [13:01:02] so the failover was fine [13:01:09] oh, I do not doubt taht [13:01:18] Yeah, but double checking just in case [13:01:21] just if I should restart it yet or wait or something else [13:01:46] maybe check the clients, maybe already failed there? [13:16:54] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3837301 (10Marostegui) Progress on s1 will be tracked here: [] labsdb1003.eqiad.wmnet (replication stopped - will not get it T142807... [13:17:11] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3837303 (10Marostegui) [13:47:00] 10DBA, 10Epic, 10Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#3837442 (10Marostegui) 05Resolved>03Open Looks like @chasemp found some issues and this patch might be the cause - on labservices1001: ``` Error: Could not retrieve catalog f... [15:09:23] 10DBA, 10Epic, 10Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#3837658 (10jcrespo) This should be fully fixed now? No cloud issues? [15:10:46] chasemp: ^ [15:11:16] marostegui: I don't see wikibugs [15:11:39] wikibugs? [15:11:47] is that a service? [15:11:55] oh I assumed it was a ^ to the irc bot wikibugs [15:12:09] oh, sorry [15:12:25] 10DBA, 10Epic, 10Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#3837673 (10chasemp) >>! In T150850#3837658, @jcrespo wrote: > This should be fully fixed now? No cloud issues? Everything seems good as of now :) [15:13:47] 10DBA, 10Epic, 10Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#3837680 (10jcrespo) 05Open>03Resolved Resolved, we can reopen if we see something broken. [15:23:02] https://tendril.wikimedia.org/activity?root=0&dump=0&wikiadmin=0&wikiuser=0&research=0 [15:23:16] I am going to reboot labsdb1011 [15:29:13] cool [15:29:49] I suppose the right procedure on multi-instance upgrade is to [15:30:05] first install package, then restart instance by instance? [15:31:16] stop the instances before? [15:31:56] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3837736 (10Marostegui) The following tables have been fixed on dewiki: dewiki.change_tag dewiki.geo_tags [15:34:11] <3 you people, what a brave new world w/o labsdb100[123] [16:30:35] 10DBA, 10Operations, 10ops-eqiad: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3837888 (10Marostegui) p:05Triage>03Normal [16:58:03] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294#3837968 (10Marostegui) [17:24:52] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3838029 (10bd808) >>! In T142807#3836763, @jcrespo wrote: > @bd808 could we change the old servers... [17:25:32] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294#3838031 (10Marostegui) I have fixed tag_summary too. dewiki should be consistent now on s5 and s8 - but I will do some more checks tomorrow before starting with wikidatawiki. [18:01:15] 10DBA, 10Data-Services, 10Toolforge: I can't connect to DB replica on Toolforge due to TLS-related failure - https://phabricator.wikimedia.org/T182892#3838192 (10bd808) [18:04:56] Do we know if there are any TLS issues connecting to the labsdb10{09,10,11} servers? T182892 sounds like something that might be caused by the lb/proxy layer [18:04:57] T182892: I can't connect to DB replica on Toolforge due to TLS-related failure - https://phabricator.wikimedia.org/T182892 [18:05:25] chasemp: did you mention something about that the other day? [18:06:15] I didn't, there was something about sqlproxy (name?) and ssl weirdness jynus mentioned but we don't use that for the wikireplicas [18:06:24] ah. ok [18:06:43] to my knowledge ssl termination wouldn't work as it's not offered? I'm confused on how that ever functioned atm [18:14:00] 10DBA, 10Data-Services, 10Toolforge: I can't connect to DB replica on Toolforge due to TLS-related failure - https://phabricator.wikimedia.org/T182892#3837792 (10bd808) I'm not certain how TLS connections to the servers would have ever worked. I wonder if there is some signal that is coming from the new db c... [19:16:01] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3838337 (10RobH) [21:55:25] 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3838724 (10aaron) I keep coming with times like: ``` Same-DC (db2070.codfw.wmnet): string(56) "0.10926739454269 sec/conn (non... [23:06:54] 10DBA, 10Data-Services, 10Toolforge: I can't connect to DB replica on Toolforge due to TLS-related failure - https://phabricator.wikimedia.org/T182892#3838858 (10MaxBioHazard) 05Open>03Resolved a:03MaxBioHazard It works, thanks.