[05:18:50] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) 05Open>03stalled As per T204006#4613570 [07:14:55] where can I check if the backups are completed on dbstore2002? [07:15:03] (good morning btw) [07:15:04] Hi! [07:15:22] jynus and myself just saw that we have s2 duplicated on both, dbstore2001 and dbstore2002 [07:15:28] being dbstore2001 the active source [07:17:06] 👍 [07:17:27] banyek: that probably means we can remove the data from dbstore2002 entirely [07:17:36] OH [07:17:49] I am not so sure about that [07:17:56] tendril down? [07:17:56] I would keep the one on dbstore2002 [07:18:08] tendril seems down to me indeed [07:18:12] tendril works for me [07:18:23] we just got paged [07:18:49] I db1115 us for tendril? [07:19:01] I look after this [07:20:26] banyek: check -operations channel [07:20:28] we are debugging there [07:20:30] there are plenty of space but you know it, I see you logged in :D [07:50:10] 10DBA: db1092 crashed - https://phabricator.wikimedia.org/T205514 (10Marostegui) [07:51:19] something happened at 7:16 at tendril [07:51:53] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&from=now-6h&to=now-1m [07:52:36] Maybe db1092 being unavailable triggered something on tendril? some sort of overload? [07:53:33] many things could be [07:53:35] 10DBA: db1092 crashed - https://phabricator.wikimedia.org/T205514 (10Marostegui) On reboot: ``` 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists. ``` [07:53:39] when did db1092 go down? [07:54:07] at 6:10 [07:54:07] as per its HW logs, the failure is at 06:05 [07:54:24] 6:11:32 was the last prometheus report [07:54:58] could be some build up, sure, but that is true with almost anything- memory leak, etc. [07:55:14] but nprd didn't work at all, aparently [07:56:07] it starte alerting at 7:17 [07:57:08] didn't banyek wanted to practice cloning? he will be able to do it with db1092 :-D [07:57:17] indeed :) [07:57:18] cool then [07:57:24] I am upgrading kernel and mariadb and giving it a last reboot [07:57:26] although let's wait for [07:57:31] full debugging [07:57:36] We start to upgrade db1107 now, after that I can reclone it [07:57:48] no rush for db1092 [07:57:55] altough let's depool it [07:58:03] As long as we reclone it before the failback, I am fine [07:58:08] jynus: I am doing the patch now [07:58:20] I will disable alerts [07:58:51] FYI: my daugther is sick again - today my wife is here, but possibly tomorrow and friday I have to deal with her [07:59:03] at least until 2 or something [07:59:09] I try to figure out something [08:06:24] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - https://phabricator.wikimedia.org/T205514 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson @Cmjohnson looks like we need a new BBU. This host is under warranty, can you talk to HP and see if we can get a new BBU before 10th Oct (a... [08:15:02] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [08:15:31] marostegui: can you see why I disable lagging alerts on dbstores? they keep spaming -operations [08:16:19] indeed :) [08:16:23] I didn't enable them eh! [08:19:16] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - https://phabricator.wikimedia.org/T205514 (10Marostegui) [08:19:25] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) [08:21:07] 10DBA: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) Thanks @Papaul! So my proposal is to do: T205257#4610104 @jcrespo @Banyek thoughts on that? [08:21:26] 10DBA, 10User-Banyek: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 (10Banyek) db1107 upgrade done, the operaton is resumed [08:22:35] 10DBA, 10User-Banyek: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 (10Marostegui) Great! Make sure to update: https://phabricator.wikimedia.org/P7510 and reload haproxy! [08:22:42] 10DBA, 10User-Banyek: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 (10Banyek) upgrade of db1108 is planned on 2018-09-27 10:00 CEST [08:24:28] marostegui: the controlled failed, not only the battery [08:25:03] really? the idrac logs didn't say that, only mention battery [08:25:11] I will show you in a sec [08:25:15] when they finish downloading [08:25:20] Great! [08:25:22] pasting to ticket [08:25:30] Too bad they have different info on the idrac than on the web interface [08:28:30] 10DBA, 10User-Banyek: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 (10Banyek) Update for 1108: db1108 Agree in a usable maintenance window Downtime dbproxy1004 and dbproxy1009 services in icinga Disable eventogging_sync.sh on db1108 Disable eventlogging_cleaner Do the... [08:31:29] is there any reason why the haproxies don't have the /stats webui enabled? [08:32:07] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10jcrespo) {F26209361} AHS log on google drive, cannot attach it on phabricator: https://drive.google.com/open?id=1Y9RikXhRlZHY-7gNN0MeK5AdOjswPXJH [08:33:10] ^ marostegui: I was wrong, the power loss is old [08:33:21] So only BBU then? [08:33:27] yep [08:33:44] I cannot download that file [08:33:45] and it was like the last time- a disk degrading brought down the whole thing [08:33:56] which one? [08:34:02] the one in gdrive [08:34:12] that is shared with all WMF accoutns [08:34:27] yeah, but gdrive does nothing [08:34:27] does it fail or say permission denied? [08:34:33] let me check [08:34:36] no, it just doesn't download [08:35:40] try now [08:35:43] lets see [08:35:56] it is not as if you are going to read anything [08:36:07] it keeps scanning for viruses... [08:36:19] now it worked [08:36:22] what did you do? [08:36:51] changed the link [08:36:55] haha [08:36:58] It works now [08:37:49] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) [08:42:09] hello :) Time to chat now about dbstore1002 goals or an email is betteR? [08:43:04] elukey: give me 10 mins if you can [08:45:15] ack! [08:48:52] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) compressing s2 is resumed [08:56:58] elukey: I am back! [08:59:09] banyek: I am not planning to debug db1092 any further, as we have everything we need. Not sure if jynus wants to do something with it, but if not, we can reclone it if you like. I doubt chris will get to it today, as he needs first to deal with HP [09:00:33] jynus:? ^^ [09:06:15] ok [09:06:31] but shouldn't we talk about backupd first? [09:06:37] we should [09:06:47] I'd prefer to something more important [09:06:51] *do [09:07:05] backupd? [09:07:08] db1092 is going to be there today and tomorrow [09:07:12] backups [09:07:24] ah [09:07:37] I use a qwerty keyboard where s and d are close toghether [09:07:40] sure, whatever you guys want [09:07:47] marostegui: ack! So I had a chat with my team about a possible goal, and we thought that we could split it into two parts. The next Q should be dedicated to get the hardware, set it up and possibly (maybe stretch goal?) to have puppet config deployed and data replicated. Then in another quarter the replacement of research user (with a new way to provide accounts) and the migration from dbstore1 [09:07:53] 002 to the new system. Does it sound acceptable for you guys? [09:08:06] jynus: No need to explain that, I use the same keyboard. I thought backupd was some sort of daemon or something [09:08:13] :-) [09:09:38] elukey: you are not planning to order the hardware during Q2? [09:09:49] elukey: with next quarter you mean the one starting next week? [09:10:23] wow, you use the same keyboard than me!? you mean like https://youtu.be/1Y2zo0JN2HE?t=12s ? [09:11:01] jynus: I hope you use a mechanical one, otherwise I am not talking to you anymore [09:11:10] marostegui: yes next quarter I mean Q2, in theory the hardware should already been in the pipeline to be ordered IIUC, procurement tasks etc.. are already in place waiting for a quote [09:12:12] well, mid-Q3 the server goes down, we have a replacement or not, you heard faidon [09:12:20] elukey: Then yeah, getting the hardware and trying to get puppet adapted for dbstore1002-multiinstance sounds like a good idea. If that is done on time, we can help with getting all the data there [09:12:28] And obviously advise on the puppet stuff [09:13:24] jynus: yep exactly, but at the same time it seems a lot to set up the replacement and migrate user (with new accounts) in one Q, given the fact that you guys are already super full [09:13:42] marostegui: I have no idea about what puppet work needs to be done but I can surely help/work on it [09:13:46] well, we are not in charge of that [09:13:49] elukey: I would assume your team would take the lead on the HW orderning + puppet? [09:13:53] analytics is [09:14:00] we can can just help you do it! [09:15:52] jynus: sure but up to now you guys have managed it, and I will surely not able to do maintenance to three new databases with my limited knowledge :D. So I agree that technically we own it (as analytics), but practically I'd need some guidance from you about how to set up this work correctly [09:16:11] elukey: that is fine, we can help with that [09:16:12] for the 3rd time, we will help [09:16:18] 4 time now :-) [09:16:52] and for the 4th time I am telling you the same thing, since you keep stating that "analytics owns it" :) [09:17:08] so we are kinda repeating the same thing each time on both sides [09:17:22] elukey: My main question for you (your team) if you'll lead that. The orderning process (we can help reviewing the quotes), following up the racking and installation and getting some puppet code up (so we can review and advise) [09:18:40] marostegui: sure I can definitely lead it [09:19:40] elukey: That's good. So maybe we need to agree on the scope now. I think getting the hardware ordered + racked + installed sounds like a good idea. Not completely sure about the data and puppet part [09:19:45] Maybe that can be stretch [09:20:04] exactly, I think it is good [09:20:29] I'll try to push for the stretch anyway since, as Jaime was saying, mid Q3 seems really near :D [09:20:31] elukey: As you said we're pretty full already + holidays + xmas time [09:20:58] elukey: Yeah, I am fine with having it as a stretch point [09:21:45] ok then, I am going to explain what we discussed in here and ask Nuria/Mark to review the goals accordingly [09:21:46] elukey: My main concern is not adapting multinstance role that we already have to dbstore1002, but discovering all the stuff that is hardcoded in dbstore1002 that we might not be even aware of :) [09:22:16] it will be fun! [09:22:37] * marostegui has another concept of "fun" [09:28:28] https://www.youtube.com/watch?v=dMQdqO4RV3w [09:28:49] lol [09:32:09] * elukey avoids to link Rebecca Black - Friday in here [09:32:51] FRIDAY FRIDAY!!!!!! [09:33:17] http://isitfridayyet.net/ [09:33:20] :( [09:42:48] I am going to deploy the new root client grants to all databases [09:42:55] ok [10:03:47] I thought we changed the labs passwords [10:04:10] don't they have a different one from core? [10:04:23] we did [10:04:32] but I applied the new ones with the same one [10:04:36] so I have to undo that [10:04:41] aah right [10:05:03] and that is why we need a tool [10:05:27] then there is some hosts that failed that shouldn't have [10:05:50] sorry. there's also a Juniper change pending, maybe that affected labsdb? https://phabricator.wikimedia.org/T205513 [10:05:53] like db1107 [10:06:02] no, nothing to do, they don't failed [10:06:05] ok [10:06:38] so db1107 doesn't have the proper grants [10:14:54] I go to have some lunch [10:15:18] elukey: I have uniformized the grants on eventlogging dbs [10:15:38] they had some unsecure defaults [10:15:43] for root [10:17:18] ah thanks! Can I know the grants just as FYI? (curious, but when you have time) [10:17:44] the old ones or the new ones? [10:17:57] it had a root@10.x [10:18:14] I have restricted to the root clients and to the 2 eventlogging hosts [10:18:34] and removed the grant option from the eventlogging hosts [10:19:15] eventlogging sync was temp. halted [10:19:23] but then I checked to put it back running [10:20:05] db1107 wasn't available from neodymium/cuminXXXX etc [10:20:29] more details of the fixes at T177385 [10:20:30] T177385: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 [10:21:53] marostegui: for fun, run "pt-show-grants | grep root" on db1118 [10:23:18] it doesn't work? [10:23:25] IT DOES [10:23:28] too well [10:23:46] Ah, i was running that exact string you sent [10:23:50] I may have broken it [10:23:50] let me get it to connect :) [10:24:01] I am precisely changing root@localhsot [10:24:06] that same command doesn't work now [10:24:15] what was the issue? [10:24:41] it was a very strange format [10:24:49] I may have killed db1118 access [10:25:19] Who neds mysql 8.0 anyways! [10:25:22] *needs [10:25:22] or it doesn't work with the mariadb client [10:26:42] I am going to try to make it work [10:27:00] :) [10:28:41] mysql --disable-auto-rehash --ssl-mode=DISABLED --socket=/run/mysqld/mysqld.sock [10:28:49] it just needs tweaking [10:29:00] and the using sock_auth is not immediately apparent on the grants [10:32:10] yeah, our grant commands don't work on mysql 8 [10:34:16] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) An update on this. We have pretty much completed all the tasks we had scheduled for the failover. We are now advancing on other tasks, to complete them faster... [10:40:20] there is a lot to change with mysql: grant and account handling is different [10:40:28] and user table is different too [10:44:14] yeah, I was wondering about the user table indeed [10:44:23] I haven't checked it but I was wondering if it was too different [10:54:35] there is 2 bugs on mysql.py [10:54:46] one is that we don't correctly handle .wikimedia hosts [10:55:18] the other is that I just realized I don't add /root/.my.cnf as the deafault place too look configuration [10:55:25] so either it requires sudo -i [10:55:40] or a change to the mysql command line arguments [10:56:44] breferably both I would say [12:32:56] banyek: what happened with the options you set on dbstore2002 to ease replication? did you revert them? [12:57:02] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Campsite, 10User-Ladsgroup: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [13:51:23] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) In T204593 I was mentioned that I reenabled the cache on the host even the BBU is broken - I disabled it because of the hosts SPOF-ness [14:31:20] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) A support ticket has been submitted with HPE Case ID: 5332806955 [14:31:46] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) Thanks!! :-) [15:11:13] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Banyek) I will reclone this database instance [15:12:06] 10DBA, 10MediaWiki-extensions-Translate, 10Operations, 10Performance-Team, and 2 others: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Gilles) [15:41:42] Ok folks I guess I'll leave for today [15:42:57] marostegui: is it possible for me to look at what is in the parsercache servers? [15:43:31] I guess thats the only way for me to judge the impact on them, is to see what is in them :) [15:43:34] addshore: only the mediawiki wise men can interpreta that format [15:43:51] * addshore isn't even sure if he has access to there [15:43:54] as in, probably, but you are asking in the wrong channel, for us that is a black box [15:44:14] okay! [15:44:29] also, if you want to see the state, mostly you want to check the memcache + parsercache combination [15:44:36] you should be able to sql dbname -h db12345 [15:44:43] or whatever the new syntax is [15:45:08] Reedy: he said look what is in read, not just SELECT it, that may need some code [15:45:47] only the arcane ones know how to read parsercacke keys! [15:46:27] or soemting that calls some obscure mediawiki internal function :-) [15:47:04] Reedy: sorry, you were answering the access part, yes, that has the same credentials than the otehr dbs [15:47:18] :) [15:47:28] although it is configured by ip, so not 100% it will work [15:47:56] it definitely did at one point in the past [15:48:07] But not tried anytime recently [15:48:36] you need to put 3 candles to the mediawiki devil and sacrifice a goat for it to work now! [15:48:54] The germans have a lot of goats [15:49:15] the germans have a lot of Qs in identifiers! [15:58:38] Reedy: hmm, struggling to access.. [15:59:08] see, how many goats, only 1, 2? [15:59:13] aaah, --host not -h [15:59:20] lol [16:00:41] perhaps I need to specify a "cluster" or "group" [16:00:51] i guess perhaps a cluster, not a group [16:00:58] pc1/pc2/pc3? [16:01:12] it is a special set of hosts because they don't contain data [16:01:25] you should be able to get the pass and query the ips directly [16:01:33] * addshore was trying pc2004 or the ips related to that [16:01:43] please be careful, they are in read-write! [16:01:50] unlike most metadata hosts [16:01:53] * addshore will be :) [16:03:17] also, don't say I didn't warn you about that you may be disappointed about what you will find there [16:05:32] Reedy: I think your tips are missleading, you made it seem easy [16:06:36] did you manage it? [16:06:55] as far as I can tell you can't use sql.php [16:07:02] :-( [16:07:03] * addshore will try with mysql now [16:07:11] you should be able to do it manually [16:07:55] you may think I am not telling you how, but I genunily don't know, never touch mediawiki or the mediawiki acounts at all [16:08:17] we manage the dbs off-band with a separate set of accounts [16:08:55] ahh, sql.php actually allows you to pass parameters through to the mysql command, then then still do its magic for fetching dbnames! [16:09:28] * addshore might have to tap out and try this again later [16:10:23] ERROR 1045 (28000): Access denied for user 'wikiuser'@'10.64.32.16' (using password: YES) [16:10:34] different user? [16:10:36] to be fair, not accessing the pc hosts may be a feature [16:10:38] anyway, im off for now [16:10:49] intended by the sql.php auther [16:10:58] * addshore still needs to try to purge the parser cache up to the current time for wikidata tonight :) [16:11:54] Reedy: if you find out how feel free to send my way! :) [16:12:12] Reedy: you may be mixing passwords [16:12:17] that is certainly possible [16:13:39] I am [16:13:50] mysql:wikiadmin@pc1004 [(none)]> [16:13:54] Yeah, right password helps :P [16:16:42] Is wgDBsqlpassword even still used/ [16:17:06] https://www.mediawiki.org/wiki/Manual:$wgDBsqlpassword [16:17:12] This feature was removed from MediaWiki core in version 1.5.0. [16:17:16] Why do we still have it set... [16:20:39] Ooo, you got in though? [16:40:42] Reedy: ^^ [17:05:09] yeah [17:23:40] With what command? :) [17:31:30] mysql -u wikiadmin -h pc1004 -p [18:28:13] 10DBA, 10MediaWiki-Watchlist, 10Growth-Team (Current Sprint), 10Wikimedia-production-error: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10JTannerWMF)