[04:01:40] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, 10Performance-Team, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3496077 (10Krinkle) [05:13:41] 10DBA, 10Data-Services, 10Security-Team, 10WMF-Legal, and 5 others: Make wbqc_constraints table available on Quarry et al. - https://phabricator.wikimedia.org/T170927#3496145 (10Marostegui) labsdb1011 will have puppet disable till Monday most likely due to maintenance, but if you let me know the command yo... [05:43:33] 10DBA, 10cloud-services-team: Compress InnoDB on db1102 - https://phabricator.wikimedia.org/T172169#3496169 (10Marostegui) [06:19:17] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3496180 (10Marostegui) [08:14:35] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3496301 (10Marostegui) I have compared the list and I believe it can be dropped from: ``` bawiki de_labswikimedi... [08:15:40] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3496302 (10Marostegui) p:05Triage>03Normal [08:17:08] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3496306 (10Marostegui) [09:42:01] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3496443 (10MarcoAurelio) Humm... eswikibooks never had flaggedrevs installed afaik. There was a discussion about e... [09:42:27] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3496444 (10MarcoAurelio) Same for eswiki. [09:50:26] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3496463 (10Marostegui) Thanks @MarcoAurelio, so from your side the list of wikis that can get the table delete: T1... [09:56:00] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3496482 (10MarcoAurelio) @Marostegui I'd have to check other wikis as well but as for meta, eswiki and eswikibooks... [12:17:48] 10DBA, 10Wikimedia-Site-requests, 10Tracking: Database table cleanup (tracking) - https://phabricator.wikimedia.org/T18660#3496810 (10KartikMistry) [12:20:01] 10DBA, 10ContentTranslation, 10Language-2017 Sprint 10, 10WorkType-Maintenance: Remove cx_drafts table from production - https://phabricator.wikimedia.org/T172364#3496811 (10KartikMistry) [12:20:35] 10DBA, 10ContentTranslation, 10Language-2017 Sprint 10, 10WorkType-Maintenance: Remove cx_drafts table from production - https://phabricator.wikimedia.org/T172364#3496813 (10Marostegui) Thanks for the ticket - we will take it from here [12:43:16] volans: we can put https://gerrit.wikimedia.org/r/#/c/280947/1/dbhell/db-config-json.php on mediawiki-config (aka noc), and get a json string [12:43:54] then we can read it from an icinga check or the databases [12:44:24] also marostegui^, I only commented it because it was volans' idea [12:54:28] jynus: could be an idea, why primary DC is based on argv? [12:58:18] volans: that is because it was intended for something else [12:58:20] however [12:58:29] :) [12:58:29] we are reinventing the wheel [12:58:36] https://noc.wikimedia.org/db.php?format=json [12:58:38] :-) [12:58:46] someone had the idea before [12:58:54] sadly, it needs a few fixes [12:59:27] jynus: going to disable gtid on the slaves now (I have done all the tasks before) [12:59:29] that I introduced- checking the mw_primary config, and explicitly set the masters (json has no concept of ordering) [12:59:33] marostegui: ok [12:59:44] yeah [12:59:47] but that [12:59:59] also why having hosts and loads splitted? [13:00:00] could be a good basis for a monitoring until [13:00:30] volans: don't ask me about that :-) [13:00:35] ask whoever did it [13:00:46] I know, was trolling :-P [13:00:49] I will send a patch to complete it and make it saner [13:01:16] and once deployed, I will be able to use it for a proper check [13:01:23] maybe even for heartbeat [13:02:13] we may have to make sure noc is also not a SPOF [13:02:23] not sure if it is active-active [13:02:46] but this looks good as a short term thing for many needs [13:03:14] I think is a/p right now but there was some plan to make it a/a, not sure about current status [13:03:23] I will get involved [13:03:41] sounds like a plan! [13:03:44] sorry, marostegui it is not that I am not listenting to you- but I finally found the light in the tunnel [13:03:55] hahaa no worries :) [13:04:04] once the big steps come along I will make sure you review them :) [13:12:24] jynus: going to start moving slaves carefully under the future new master [13:12:30] sure [13:12:46] long when you start, once is ok :-) [13:12:48] *log [13:13:12] yep [13:13:29] I will be checking https://tendril.wikimedia.org/tree [13:13:34] thanks :) [13:13:41] maybe as we are both here [13:20:06] jynus: topology changes done [13:20:20] when you have a sec review the patches and we can start the fun part [13:20:44] 2002 doesn't have s4, right? [13:20:57] nope [13:21:24] checking [13:21:26] thanks! [13:23:27] db2051 is getting delayed (the new master) :| [13:23:47] ? [13:23:55] recovered [13:24:00] probably something punctual [13:24:34] only spikes, right? [13:24:36] yeah [13:24:39] I setup db2016 [13:24:47] with no binlog sync [13:24:51] that may help later [13:24:54] ah yeah [13:25:08] I hoped wasn't necessary [13:25:14] but I didn't trust the disks [13:25:16] so, how do you want to do heartbeat today? kill it on the old master, merge puppet, and run puppet on both? [13:25:27] yeah [13:25:29] that is ok [13:25:32] I think [13:25:33] or merge puppet and run puppet at the same time on both and let it take care of it? [13:25:44] i prefer the first option :) [13:25:50] I thinks it doesn't matter [13:26:08] i prefer no heartbeat for a sec than 2 heartbeats running for a sec :) [13:26:16] but I like the first more if it was an active scenario [13:26:23] jynus: I figured out a plan for the coverage gap, marostegui has to move to australia ;) [13:26:29] oh [13:26:31] ha [13:26:41] the coverge was more for icinga, but that works, too [13:27:15] chasemp: the plan is to add more paging, but only after removing others as a prioritization thing [13:27:16] I hear new guinea is very nice [13:27:23] and they speak spanish! [13:27:27] heh [13:27:30] do they? [13:27:44] jynus: really that makes sense I'm just being not-funny :) [13:27:47] When I was there LOTS of people spoke spanish [13:28:08] marostegui: note the buffer pool on db2051 is not yet full [13:28:17] so that may explain partially the lag [13:28:23] jynus: yeah, that could be it, good one [13:28:25] did you in any case [13:28:32] downtime the lag checks? [13:28:36] yep [13:28:43] in case something goes wrong [13:28:46] great [13:28:48] yep, everythng downtime in s4 codfw [13:29:00] ok, then go for it [13:29:12] the more complex part lately is the topology changes [13:29:14] killing heartbeat [13:29:38] after we decided to stop using the bad-behaving GTID, it will probably be a smoother change [13:29:45] deploying puppet [13:30:12] running puppet on the hosts [13:30:17] lag is a bit high now [13:31:06] db2051 got heartbeat enabled [13:31:21] lag recovered? [13:31:23] getting better again [13:31:25] yes [13:31:31] 1-minute 1 minute ago [13:31:33] ok, going to deploy mediawiki config? [13:32:11] ok [13:32:19] bytes sent increased a lot [13:32:25] probably just the replication [13:33:24] there are some interesting traffic spikes on s4 [13:33:36] but almost sure due to commons, not the maintenance [13:33:47] (those crazy metadata issues) [13:33:59] which graph are you looking at now? [13:35:15] tendril [13:35:27] ah ok :) [13:35:29] I assume garpfana send bytes would report the same [13:35:43] MW config is deployed [13:36:17] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=5&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2051&var-port=9104&from=1501756568685&to=1501767368685 [13:36:23] so that is really it [13:36:42] It was worse when replication broke on my first CHANGE MASTER [13:36:44] :-) [13:36:47] haha [13:36:56] well, you were brave with gtid [13:38:17] I think the only thing that I may do differently is that [13:38:36] I would depool the master before pooling it [13:38:48] to avoid long-running queries [13:38:52] yeah [13:38:55] I wonder if the current master [13:38:55] That is a good tip [13:39:05] if it was active, I mean [13:39:10] not in this case [13:39:18] yeah but in eqiad that is a MUST I think [13:39:22] I wonder if the current master may have table partitioning? [13:39:25] :-D [13:39:26] (I didn't think about it) [13:39:30] I didn't either [13:39:35] but just thought about it [13:40:05] I will make it in the notes [13:40:25] jynus: re:cumin with resources, this is a valid query to an existing define with parameters: 'R:cassandra::instance::monitoring and R:cassandra::instance::monitoring%tls_cluster_name = services' [13:40:26] Let's move the OLD master under db2051 then? as everything looks stable [13:40:37] it desn't [13:40:39] it is api [13:40:43] and support for query aliases will be introduced very shortly (already merged in master) [13:40:44] not vslow [13:40:46] rcs [13:40:57] marostegui: +1 [13:41:23] volans: we can try mariadb::instance [13:41:24] \o/ [13:41:40] and that sould give us all multi-instance hosts? [13:42:12] db1102.eqiad.wmnet,dbstore2002.codfw.wmnet [13:42:35] volans: yep [13:42:37] that is perfect [13:42:44] so that may be enough [13:43:02] volans: in the future we may want to do a backend [13:43:10] an SQL backend [13:43:27] I think you may have proposed that in the past [13:43:40] you mean transport I guess [13:43:46] yes, sorry, that [13:44:22] so we work with mariadb instances and conenct using SQL rathern than puppet + noeds [13:44:54] for now, that + running mysql locally would be enough [13:45:06] yeah, it makes sense to look into it [13:45:15] not a priority at all [13:45:28] but if we abandon puppet as a source of truth for that [13:45:42] it makes not sense to create a paralel tool [13:46:34] indeed [13:46:46] I do not want to make it a goal of the tool [13:46:50] but I do not want to ignore it either [13:47:16] as so much work is already on it regaring paralelization and failure handling and all that [13:48:11] what it is clear is we can delete the salt grains [13:48:25] that for sure! [13:48:26] I am going to do that right now [13:48:36] marostegui: how is replication doing? [13:48:51] i enabled gtid on db2037 and so far so good [13:48:55] there are spikes, but the usual ones [13:49:01] jynus: so you plan to add shard/instance parameters to mariadb::instance define? [13:49:05] I would give it a day [13:49:27] To enable gtid you mean? [13:49:27] volans: technically, the name of the instance is the shard [13:49:38] sorry, too hot, I already forgot :( [13:49:51] marostegui: sorry, 2 convos here [13:49:54] XD [13:49:57] marostegui: enable gtid now [13:50:08] jynus: yeah, I was doing so :) [13:50:10] marostegui: I will give replication 1 day to evaluate [13:50:21] We are making jynus to hyper thread, volans [13:50:34] lol [13:50:37] marostegui: that is ok- you both do the work for me [13:50:41] jynus: yeah, hopefully it will better than db2019 [13:50:44] hahaha [13:50:50] and I only tell you what to do [13:50:55] I am practically a manager now [13:51:02] https://vignette1.wikia.nocookie.net/horadeaventura/images/2/20/Nofor_te_da_l%C3%A1tigo.gif/revision/latest?cb=20141224010727&path-prefix=es [13:51:10] to query a specific "name" you can use 'R:mariadb::instance = s1' and right now you get: dbstore2002.codfw.wmnet [13:52:09] https://twitter.com/DBAReactions/status/881240836918448129 [13:52:27] volans: let me try [13:52:32] hahahahaha [13:53:04] lol, so without a = value you get all instances that have the define at least once, with a = value the name has to match too the value [13:53:33] so it works perfectly, doesn't it? [13:54:03] yes should cover the use cases we need [13:54:10] I mean, I do not know if we will use instance just for multiinstance [13:54:12] or for all [13:54:29] but if we don't, we will create a role::mariadb::common or something [13:54:38] (resource) [13:54:55] that we'll see, no problem we can mix stuff and with aliases also if a query is complex we could save it as A:mariadb-s1 for example [13:55:58] I am doubtful because a multiinstance is slightly different than a dedicated server [13:56:17] for example, a dedicated server you start it with systemctl start mariadb [13:56:18] 10DBA, 10Patch-For-Review: db2019 has performance issues, replace disk or switchover s4 master elsewhere - https://phabricator.wikimedia.org/T170351#3497035 (10Marostegui) p:05Triage>03Normal a:03Marostegui db2019 has been failed over db2051. db2051 is now the master We will see if the replication impro... [13:56:42] while an instance uses systemctl start mariadb@s1, for example [13:57:24] on the other side, there is duplicated code [13:57:27] could make sense to have it everywhere? [13:57:32] yes [13:57:40] it could be some sort of molly too ;) [13:57:42] so instance with 'default' title [13:57:52] is considered a dedicated server [13:58:00] molly? [13:58:29] if you do mariadb@s1 on a s2 host by mistake :-P [13:58:37] like our molly guard for reboots [13:58:43] volans: fixed already [13:58:55] can you see the "yes, we thought about that" XD [13:59:04] :D [13:59:22] if you try to start mariadb@s1 on the wrong server, it will fail because lack of configuration added by puppet [13:59:48] if you try to start mariadb on a multi-instance shard, it kills itself as a preexec and logs that you should use @ syntax [14:00:10] firt one ConditionPre, second StartPre or something [14:00:13] exactly, so using the multiinstance code everywhere add some safety check for free [14:00:19] yes [14:00:49] I thought about it, but of course I started testing without touching existing code [14:00:58] now I will probably do it [14:01:00] only thing that will be complex would be to do a systemctl status on all hosts, via cumin I mean [14:01:23] yeah, we lack of a way to get the socket [14:01:42] and a way to query what is actually there [14:02:06] I also was hoping not to implement multi-instance on init.d [14:02:20] for older versions [14:02:37] I also have other questions- should the role open ports [14:02:48] hopefully not! I'm wondering if we should expose the available shards/instances somewhere in the hosts for easy consumption [14:02:53] or should the individual instances do it [14:03:11] "hopefully not" was for the init.d ofc ;) [14:03:15] should we have in hiera a list of shards to use as a quick way to add shards? [14:03:29] like: db1053: s1,s3,s4 [14:03:41] and then resouce[hierakeys] [14:03:55] I do not like using hiera for that [14:04:06] but we may have exact same hosts [14:04:16] but one with 7 shards and other with 4 [14:04:42] it's all temporary though, so not "that" bad [14:04:44] this is the summary [14:04:52] of why I am now breaking things [14:05:01] bulding the bricks [14:05:14] eheheh [14:05:15] but hopefuly later having good abstractions [14:05:24] are the instance ports the same everywhere? [14:05:32] yes [14:05:44] good, should help at some point [14:06:03] there is a mapping s1=3311, s2=3312, x1=3320,m1=3321 [14:06:06] s/should help/could be helpful/ [14:06:30] and avoiding the default port [14:06:55] then prometheus was more complicated [14:06:59] it uses 9104 port [14:07:09] but now we have to host one instance per mysql instance :-) [14:07:23] so we use 10000 + mysql port [14:07:31] all so much fun [14:07:53] lol [14:08:02] that is basically why I ask patience when I am "breaking" puppet [14:08:14] I think eventually it will get bettter [14:36:20] 10DBA, 10Patch-For-Review: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321#3497282 (10Marostegui) I am planning to either s4 or s5 to dbstore2002 Current state of dbstore2002 ``` root@dbstore2002:~# df -hT /srv/ Filesystem Type Size Used Avail Use% Mount... [14:41:31] 10DBA, 10Cloud-Services: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#2931704 (10Reedy) Wiki is created now! [14:46:34] 10DBA, 10Cloud-Services: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041#3497370 (10jcrespo) Thanks: we will run our sanitization and private data checks, and extra grants needed, and when done, assign it to cloud team. [15:33:06] 10DBA, 10Wikimedia-Site-requests: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3497566 (10Reedy) 05Open>03Resolved Wiki has been created. Initial user needs creating via createAndPromote, but @Dereckson is more than capable of doing that [15:45:56] 10DBA, 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, and 2 others: Add index to mediawiki_page_create_1 table - https://phabricator.wikimedia.org/T170990#3497629 (10elukey) [15:48:41] 10DBA, 10Analytics, 10Research: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#3497641 (10Halfak) [15:49:22] 10DBA, 10Cloud-Services: Prepare and check storage layer for hi.wikiversity - https://phabricator.wikimedia.org/T171829#3477438 (10Reedy) Wiki has also been created today [15:52:25] 10DBA, 10Analytics: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3497685 (10elukey) [15:52:37] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 10Patch-For-Review, 10Schema-change: flaggedrevs.fr_user is unindexed - https://phabricator.wikimedia.org/T172207#3497686 (10demon) I can confirm it's unused on mediawikiwiki. Stupid experiment: I pushed for installation and I p... [19:45:47] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3498662 (10Milimetric) Hm, @Marostegui I need some help. I ran the sqoop job to import from all the wikis except the ones you mentioned... [20:21:55] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3498832 (10Marostegui) @Milimetric can you try jawiki_p again for instance? [23:22:19] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Operations, 10Performance-Team, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3499337 (10Krinkle)