[08:36:57] 10DBA, 10Patch-For-Review: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294#3847288 (10Marostegui) I am fairly confident that dewiki is now really consistent. Will keep doing a bit more tests to check it out. Still fixing and checking wikidata [08:38:16] ./compare.py db2071 db2088:3311 enwiki text old_id --from-value=0 --to-value=1000 -> No differences found. [08:38:56] Ah cool :) [08:38:57] thanks [08:39:33] check the latest version from: https://gerrit.wikimedia.org/r/345188 [08:55:27] hello! I did some refactoring of the eventlogging puppet master/slave code in https://gerrit.wikimedia.org/r/#/c/398869. This also includes the eventlogging_cleaner.py script on the eventlogging master to sanitize its data.. [08:55:49] The only thing that it is missing is probably the user on the m4 master db [08:56:13] the sanitization script has been working fine on db1108 for a month [08:56:41] the idea is to start manually and sanitize data up to say 4 months ago [08:56:50] and then enable the cron [08:59:15] elukey: I don't think you need our review for that [08:59:23] you only need how to use myloader! [08:59:41] :D [08:59:57] well I always like to get a more expert opinion before doing things on databases [09:00:02] https://github.com/maxbube/mydumper/blob/master/docs/myloader_usage.rst [09:01:54] ah I'd also need to add the user to mariadb [09:03:29] checking [09:05:42] jynus: "Quick & dirty script" was clear enough ;) [09:08:14] jynus,marostegui: I'd just apply the eventlogcleaner's grant on production-m4.sql.erb [09:08:21] (on db1107) [09:08:33] You want to join the DBA team? [09:08:35] We need people! [09:10:28] you need people breaking databases and causing more pages? :D [09:14:26] I was playing around with 8.0 yesterday [09:14:45] https://gerrit.wikimedia.org/r/399115 [09:14:55] https://gerrit.wikimedia.org/r/399113 [09:15:37] ooooh nice!! :) [09:58:36] I am happy now with https://puppet-compiler.wmflabs.org/compiler02/9404/ [09:58:54] it works now? [09:58:58] the compiler I mean [09:59:01] yes [09:59:12] if you can check the hiera keys at some point [09:59:30] which is the thing that the compiler cannot check & easy to mess up [09:59:57] doing it now [10:00:17] sorry for not doing it earlier, I am doing archeology and I put my chat on the background to avoid distractions [10:00:48] the idea is to remove the first 2 lines, host by host [10:01:34] I will deploy also host by host, and puppet doesn't reload automatically the config [10:07:22] marostegui: no problems found? [10:07:42] that is strange, I normally have lots of mistakes [10:07:49] No, all the hiera files look good [10:07:59] I have checked the master/backup hosts and their IPs [10:08:05] with the ones in production now [10:08:07] cool, thanks [10:08:35] I will deploy after noise comes down [10:08:43] remember once that is deployed [10:09:07] https://gerrit.wikimedia.org/r/#/c/398508/ should be abandoned/converted to hiera [10:52:42] I wonder if I should directly just reimage the proxies? [10:53:15] the reimage issue looks solved, at least db1112 worked fine [10:53:26] well, I would reimage them to stretch [10:53:31] ah :) [12:02:39] another refactoring to enable the firewall on proxies: https://gerrit.wikimedia.org/r/399164 [14:11:49] 10DBA, 10Operations: Reimage and upgrade to stretch all proxies - https://phabricator.wikimedia.org/T183249#3848018 (10jcrespo) p:05Triage>03Normal [14:12:17] 10DBA, 10Operations, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3848045 (10jcrespo) [14:12:19] 10DBA, 10Operations: Reimage and upgrade to stretch all proxies - https://phabricator.wikimedia.org/T183249#3848043 (10jcrespo) [14:14:54] any more tickets you think we can add to https://gerrit.wikimedia.org/r/399164 ? [14:26:56] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848099 (10jcrespo) [14:37:04] Hello DBAs, could I get a quick review for https://gerrit.wikimedia.org/r/399182 ? I'm working on fixing the broken DB situation in beta labs [14:37:16] And that commit depools the broken replica [14:37:33] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1001.eqiad.wmnet'] ``` The log can be found in `/var/... [14:43:44] RoanKattouw: I can +1 it as far as "this is not afecting production", but I have no idea how beta is setup [14:44:11] I also got addshore to volunteer to merge it; now waiting for Jenkins to not -1 me for not spacing my comments correctly [14:44:35] Basically it's just one master and one replica, and replication broke badly leaving both DBs in an inconsistent state which I'm fixing now [14:44:36] that certainly looks like how we depool servers on production [14:44:44] :) [14:45:09] Yeah I used that as an example [14:45:41] Fortuitously, the most recent commit in the config repo was a depoolig :) [14:45:55] beta should be more slimilar to production [14:46:06] there is no reason why it does its own thing [14:46:48] different packages, configuration, not even similar topology [14:47:16] we do read only slightly differently [14:47:37] Oh the read-only thing I improvised [14:47:38] but maybe it is just a production particularity [14:47:41] I don't know how we do that in prod [14:47:45] see [14:47:53] well, for starters [14:47:58] Or in beta for that matter, I just did something that I knew would work [14:48:08] read only on mediawiki doesn't really disable writes [14:48:14] "beta should be more slimilar to production" I agree [14:48:35] Re beta needing to be more similar to prod, here's a fun one for ya: T183245 [14:48:37] T183245: Ensure replica DB in labs is read-only - https://phabricator.wikimedia.org/T183245 [14:48:41] so you have to disable writes on the database too, it avoid drift [14:48:46] The replica DB server was (and is!) writable [14:48:56] That's one of the causes of this mess [14:49:07] no, what I mean is that, even if it was in read only, process keep writing [14:49:23] that is we cannot trust mediawiki [14:49:23] I know $wgReadOnly doesn't fully stop writes, but I have faith that it'll stop writes to the text and revision table [14:49:26] *tables [14:49:29] Those are the only ones I care about [14:49:33] ok [14:49:42] The text table was the only one that was being written to on the replica [14:49:42] Why dont we actually just have beta as a beta version of prod? [14:49:46] note I am not complaining, just warning :-) [14:49:49] drives me mad [14:50:39] addshore: I can tell you why, but you are not going to like the answer [14:50:40] addshore: That would be nice although it would probably take more resources, if you want to have ~900 wikis across 7 DB clusters with multiple replicas each [14:50:59] Or you have to simplify the DB setup but then you compromise on "similar to production" [14:51:16] we do not have to have 200 hosts [14:51:34] but have a single master-replica is oversimplistic [14:51:49] and with so little content [14:51:59] so even the simpler regressions cannot be caought [14:52:09] There are so many issues that I have seen occour over the past year, costing countless hours of developer and staff time that could have been caught by a 'real' beta [14:52:22] You'd still end up with ~21 (7x3) DB hosts though, which is significantly less than 200 but also significantly more than the current 2 [14:52:47] most of those can be provisioned automatically with fake data [14:53:12] and be on containers/vms that do not really take so much resources [14:53:25] yeh, I feel like progress toward containers might help us here [14:54:04] what happens, that the DBAs already take care of production, analytics, and labs, and the last thing they one is to setup another produciton [14:54:28] specially on beta, that it should be open enough for everybody to experiment [14:55:30] thanks for taking care of that, RoanKattouw [14:55:56] I've now fixed the inconsistent state on the master on deploymentwiki and enwiki, hoping that the other wikis will have very little or zero cleanup work [14:56:11] ah! [14:56:17] Now that my change is deployed I can test this (it was reading from the replica before) [14:56:22] given the multiple production inconsistencies [14:56:45] I would expect way more breakage on beta [14:56:49] aah RoanKattouw, the change is merged already? [14:57:06] Yes, after jynus +1ed and we got lost in conversation I self-+2ed [14:57:37] coolio! :) [14:57:38] jynus: Well, this breakage was caused because of 1) bad code, 2) T183242 and 3) T183245 [14:57:38] T183242: DB handles obtained with DB_REPLICA should not allow writes - https://phabricator.wikimedia.org/T183242 [14:57:38] T183245: Ensure replica DB in labs is read-only - https://phabricator.wikimedia.org/T183245 [14:57:58] Basically someone wrote code that looked like $dbw = wfGetDB( DB_REPLICA ); .... $dbw->insert( ... ); [14:58:22] MediaWiki didn't stop itself from writing to a replica DB, and the replica was not set to read-only [14:58:33] at least it didn't reach production [14:58:47] And the insert was for a table with auto_increment IDs, so we ran into ID collisions pretty quickly, breaking replication [14:58:51] Yeah well [14:58:53] but production slaves are read_only=on, so we should have been safe, no? [14:58:57] At least in production the replicas are read-only [14:59:00] I verified that [14:59:14] yeah, we take care of that, and they always come back read only [14:59:23] even if they crash [14:59:33] https://www.irccloud.com/pastebin/Z9aebPrE/ [14:59:47] RoanKattouw: could you +2 https://gerrit.wikimedia.org/r/#/c/399188/ ? (I took it out of daniels large patch) [14:59:48] temporary readonly is better than data loss/inconsistencies [15:00:26] addshore: Thanks, I didn't realize we hadn't fixed the cause yet [15:00:45] RoanKattouw: well, the revert meant that nothing is calling the code that caused the issue any more [15:00:52] Oh and that's merged already? [15:00:55] yup [15:01:30] OK cool [15:02:46] I'm going to setup my docker based dev environment to have a replica now.... as then I would have caught this.... [15:04:46] Also Daniel is working on making MW fail when you try to do this, right? [15:05:21] yup [15:05:29] OK, enwiki and deploymentwiki are in a consistent state now [15:05:46] I tried to promote myself to sysop on deploymentwiki so I could delete a spam page and block the spammer, but guess what, it's still read-only :D [15:05:54] haha [15:06:02] I'm going to go outside for a quick walk now before it gets dark (sunset is in 20 mins) [15:06:22] But when I'm back I'll check the rest of the wikis, and once I've fixed them too (hopefully there won't be anything to fix), I'll bring labs out of read-only mode [15:07:24] RoanKattouw: you back in .nl? [15:07:59] Yeah [15:08:10] I miss .nl in Christmas, it is nice :-) [15:08:25] And apart from the weekend when it was sunny, the weather has been kinda sad and overcast [15:08:39] There is still some snow on the ground though from last week [15:08:40] usual weather you mean? [15:08:54] Yeah, the weather I fled from :) [15:08:59] hahaha [15:09:10] I got some photos from friends and it was a pretty serious snow there few days ago [15:09:17] when I lived there it never snowed that much! [15:09:46] I only remember one snowfall like that, in 2005 [15:09:49] 30cm overnight [15:10:09] wow… [15:10:10] Last week they got 15cm two nights in a row which is also pretty crazy [15:10:24] Yeah, the photos I got sent were pretty impressive [15:10:27] I am glad I was not there :-) [15:10:34] Enjoy your walk, doei! :) [15:30:03] there is a bug on some of the patches I deployed, the master.cfg file is gone [15:30:31] what? [15:34:58] profile::mariadb::proxy::master doesn't seem to be applied [15:36:57] oh, I see why [15:43:30] https://gerrit.wikimedia.org/r/399200 [15:45:37] I am happy we (you) are finally giving some love to proxies :) [15:45:39] they really needed it [15:45:51] so I set the proxy role [15:45:57] not the proxy::master [15:46:07] but because it was not applying some cchanges [15:46:30] we didn't realize they were missing [15:46:39] as extra config files were not deleted [15:46:59] but I notice it in reimage, that there was some missing files [15:47:36] now it works as intended: /Stage[main]/Profile::Mariadb::Proxy::Master/File[/etc/haproxy/conf.d/db-master.cfg]/ensure [15:48:19] echo "show stat" | socat /run/haproxy/haproxy.sock stdio [15:48:26] mariadb,db1016,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP [15:48:55] iptables -L Chain INPUT (policy DROP) [15:49:19] take a look at the proxies, they are way cleaner now [15:49:55] so role::mariadb::proxy::master include profile::proxy and profile::proxy::master [15:50:18] profile::proxy installs and does the general haproxy setup, with no rules [15:50:32] and proxy::master sets the host to failover between 2 hosts setup on hiera [15:50:59] if we want to change the primary or secondary host, we only have to edit the right hiera keys [15:53:27] yeah, that logic makes a lot more sense than having it on site.pp [16:12:34] marostegui: one last thing before you go, labsdb1009 status? [16:12:53] should I do schema change there, etc.? [16:29:03] You can do anything you like to labsdb1009 [16:29:04] it is all done [16:36:40] hello people, as FYI I've started the sanitization script on db1107 (log database) [16:37:00] how long do you expect it to run for? [16:38:33] say 3/4 days [16:38:57] 10k rows per batch, 2s of delay between batches [16:39:15] ah cool [16:39:57] it shouldn't be too aggressive, will keep checking metrics [17:39:43] 10DBA, 10Operations, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3848914 (10jcrespo) Firewall has been enabled on all proxies except the active ones: ``` dbproxy1002.yaml:profile::mariadb::proxy::firewall: 'disabled' dbproxy1003.yaml:profile::ma... [17:46:37] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848937 (10jcrespo) dbproxy1001 has been successfully reimaged, which joins the already upgraded to stretch dbproxy1004 and dbproxy1009 (although these one have to yet be reconfig... [17:56:11] RoanKattouw: Re: "Hmm if replication is *completely* broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow)" [17:56:39] RoanKattouw: https://phabricator.wikimedia.org/T180918 [17:56:42] jynus: We long figure out what the issue was [17:56:51] welcome to my nightmare :-) [17:57:08] But I was not aware of that task, I'll rea dit [17:57:39] just commenting on that particular opinion, not the whole conversation :-) [17:57:44] If you're interested in what the labs/MCR problem turned out to be, see the IRC logs posted into T183252 [17:57:44] T183252: Unbreak replication in beta labs - https://phabricator.wikimedia.org/T183252 [17:57:57] RoanKattouw: I was reading that [17:58:06] jynus: Haha yeah fair. I later figured out that the issue was the exact opposite of what I thought [17:58:16] It was because we were writing to a replica that allowed writes [17:58:31] but that is still a valid thought [17:58:31] So ChronologyProtector actually *caused* the bug, by doing its job correctly [17:58:45] outside of the context [18:00:43] Yeah [18:54:17] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3849130 (10Cmjohnson) There were 2 failed disks. Replaced both and they're rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online,... [18:57:42] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3849134 (10Marostegui) Thank you! [19:34:14] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849174 (10alanajjar) [19:36:14] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849186 (10Marostegui) Can you do it tomorrow european morning? [19:36:46] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849187 (10alanajjar) >>! In T183285#3849186, @Marostegui wrote: > Can you do it tomorrow european morning? Yes, of course (Y) [19:38:07] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849197 (10Marostegui) Let's do it at 9 UTC maybe? [19:40:15] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849218 (10alanajjar) >>! In T183285#3849197, @Marostegui wrote: > Let's do it at 9 UTC maybe? I'll try [19:41:26] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849220 (10Marostegui) Cool - I will be online anyways since 6UTC, just ping me whenever you want in the morning. [19:48:55] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3849233 (10alanajjar) @Marostegui Deal ^^ [19:54:54] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T182853#3849238 (10Marostegui) 05Open>03Resolved All good! Thanks ``` root@db1059:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID... [20:17:40] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3849296 (10Cmjohnson) Disks are wiped [20:18:26] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3849300 (10Cmjohnson) [20:18:45] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3849302 (10Cmjohnson) [20:19:00] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3849305 (10Cmjohnson) [20:19:18] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3849306 (10Cmjohnson) [20:19:38] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3849307 (10Cmjohnson) [20:19:54] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3849308 (10Cmjohnson) [20:20:17] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3849310 (10Cmjohnson) [20:20:20] 10DBA, 10Beta-Cluster-Infrastructure, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252#3849311 (10greg) [20:38:48] 10DBA, 10Beta-Cluster-Infrastructure, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252#3849376 (10Marostegui) Sorry - I was playing with the subscribers and removed the DBA tag by mistake. [20:42:21] 10DBA, 10Beta-Cluster-Infrastructure, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252#3849379 (10Marostegui) I don't think I will be able to help with this, as I am focused on fixing data dr... [20:44:53] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3849389 (10Marostegui) 05stalled>03Open