[05:11:22] 10DBA, 10AbuseFilter, 10PostgreSQL: Joins on INTEGER and TEXT fail with PostgreSQL - https://phabricator.wikimedia.org/T42757 (10Marostegui) >>! In T42757#4572252, @Daimona wrote: > @Marostegui Unfortunately, neither do I :-) However, PG is just the cause here. What we want to do (and I'm asking DBAs about)... [07:25:20] I am going to disable puppet on the DBs and merge the alert for multi instance, and test on codfw first [07:29:04] ok [07:30:23] when you are finished, I can disable gtid on the codfw masters [07:30:35] coool, thanks :) [07:30:37] I will let you know [07:57:14] jynus: let me know if you want to do that additional test of the switchdc "sync check" creating some lag in one replica [07:57:38] volans: there is currently lag on db2084:3315 (not sure if that is enough) [07:57:54] volans: yes, I would like to see the lag check failing [07:58:14] for that, we need to stop replication on a codfw master [07:58:18] ok let me run it right now and see :) [07:58:19] e.g. x1 [07:58:29] volans: it won't work now [07:58:48] we need to wait for marostegui to finish his maintenance [07:58:58] but he said there is already lag [07:59:04] on a slave [07:59:05] only on a replica [07:59:16] ahhh sorry, assumed a master :) [07:59:24] I am only using s5, btw, so you can test with any other section [08:02:10] volans: one second, it will take me some time to set things up [08:02:38] sure, no hurry [08:06:22] 10DBA, 10AbuseFilter, 10PostgreSQL: Joins on INTEGER and TEXT fail with PostgreSQL - https://phabricator.wikimedia.org/T42757 (10Daimona) @Marostegui Perfect. I guess we should seize the opportunity and rename afl_filter to afl_filter_id to make sure that it's INT. At any rate, I think this should be momenta... [08:09:15] volans: I am ready [08:09:39] what I am going to ask you is to run the "check replicas catched up" one [08:09:57] but the x1 master will not catch up because I will stop it [08:10:19] tell me when we can run that, as I do not want to leave x1 lagging for a long time [08:10:36] and wait for me to stop replication [08:11:16] jynus: which direction? [08:11:29] check that codfw is in sync with eqiad or the opposite [08:11:43] so the same we are going to do tomorrow [08:11:52] eqiad -> codfw [08:11:55] ack [08:11:55] I will stop codfw [08:12:10] ack, I'm ready [08:12:14] I guess we can also test the inverted one [08:12:30] ok, will log the stop, as it may create mediawiki monitoring issues [08:12:38] ack [08:13:32] volans: done, you should run it now [08:13:36] ack [08:13:38] and it should fail or get stuck [08:13:48] if it says correct, we have a problem [08:14:14] Failed to call 'spicerack.mysql._check_core_master_in_sync' [1/3, retrying in 3.00s]: Heartbeat from master db2034.codfw.wmnet for section x1 not yet in sync: 2018-09-11 08:13:17.001080 < 2018-09-11 08:13:49.000970 (delta=-31.99989) [08:14:22] yep failed [08:14:30] spicerack.mysql.MysqlError: Heartbeat from master db2034.codfw.wmnet for section x1 not yet in sync: 2018-09-11 08:13:17.001080 < 2018-09-11 08:13:49.000970 (delta=-31.99989) [08:14:41] does it try a few times? [08:14:48] or until canceled? [08:14:49] yes [08:14:53] [1/3, retrying in 3.00s] [08:14:56] ok [08:14:57] [2/3, retrying in 9.00s] [08:15:02] and fails at 2rd attempt [08:15:06] *3rd [08:15:11] cool [08:15:14] do you want to wait? [08:15:32] wait what? [08:15:43] has it failed already? [08:15:47] yes [08:15:48] ok [08:15:53] Tue 10:14:22 volans| yep failed [08:15:55] it wasn't clear :-) [08:16:00] sorry, you can re-enable replica [08:16:01] sorry [08:16:06] doing [08:17:41] fwiw it also failed once on es3 and succeeded at the next retry [08:17:47] it was the same heartbeat [08:17:58] Heartbeat from master es2017.codfw.wmnet for section es3 not yet in sync: 2018-09-11 08:13:52.000950 < 2018-09-11 08:13:52.000950 (delta=0.0) [09:54:01] I am going to test the alerting again, db1096 (hopefully) will page [10:28:57] I am done with the paging tests, I am going to re-enable puppet on the DBs [10:33:15] 10DBA, 10Patch-For-Review: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 (10Marostegui) Paging for replication lag / broken has been tested nicely for active/non active replicas codfw: passive DC db2084:3315 only alerted on IRC db2075 only alerted on IRC eqiad: active DC db... [10:33:24] 10DBA, 10Patch-For-Review: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 (10Marostegui) [10:33:53] jynus: do you want me to takeover the gtid check + enablement on codfw masters? [10:34:09] that would be great [10:34:14] I am doing a warming up [10:34:16] ok! taking over then [10:34:37] s/enablement/disablement [10:34:38] :) [10:34:40] no need to check on every step for that [10:34:48] just check the list afterwards or something [10:34:57] what? [10:35:18] no need to ask me to veryfy on every step like when we setup replication [10:35:34] ah, no I just wanted to make sure that wouldn't mess up with your stuff [10:43:12] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) The following masters had GTID enabled - I have disabled it: ``` db2048 Using_Gtid: Slave_Pos db2035 Using_Gtid: Slave_P... [13:11:48] 10DBA, 10AbuseFilter, 10PostgreSQL: Joins on INTEGER and TEXT fail with PostgreSQL - https://phabricator.wikimedia.org/T42757 (10Marostegui) >>! In T42757#4573297, @Daimona wrote: > @Marostegui Perfect. I guess we should seize the opportunity and rename afl_filter to afl_filter_id to make sure that it's INT.... [13:14:31] marostegui: this would work but has 2 issues https://gerrit.wikimedia.org/r/450228 [13:14:40] 1- what happens when all hosts are in read only [13:14:50] for maintenance or dc failover [13:14:56] let me see [13:15:10] Can't we change the interval of checks? [13:15:12] 2- we need a method for misc hosts [13:15:38] marostegui: probably, but I am not sure that is reliable enough [13:15:54] are you suggesting to setup a long period, like 5 or more minutes? [13:16:02] The thing is that a DC failover is done once a year (at best) and that will not page, I guess we can assume the alert is fine? [13:16:06] in soft state [13:16:10] jynus: Yeah, exactly [13:16:20] well, it is non-paging [13:16:24] anyway [13:16:30] at least right now [13:16:48] Either a longer SOFT state or just assume that we will have that non paging alert during a failover [13:17:08] which is not done often enough to be a noise problem I would think [13:17:50] the problem is puppet has 30 minutes of lag [13:18:01] what do you mean? [13:18:02] so it should be 30 minutes + maximum window [13:18:18] the check will take up to 30 minutes to change [13:18:18] ah [13:18:21] yeah [13:18:22] after etcd [13:18:24] change [13:18:42] we can force a puppet run on the affected hosts + icinga master if needed in phase8 [13:18:42] 20 or 30 minutes, cannot remember [13:18:50] volans: this is for later [13:18:57] we want a best approach here [13:19:07] I don't think this will be deployed for tomorrow [13:19:19] To be honest, I would leave the DC failover scenario aside, as it is something that rarely happens [13:19:37] ok, but think other options- master failover [13:19:41] anyway my 2cents are that a check for RO seems something that doesn't need to run each minute but is more a safety net [13:19:43] that should be frequent [13:19:50] Yeah, but we are normally in read only less than 5 minutes (normally) [13:19:51] or a master failure [13:19:54] etc. [13:20:08] marostegui: yes, but again, in this case the issue is puppet [13:20:26] icinga + puppet, as usual the 2 pain points :-) [13:21:32] retaking the conversation from yesterday briefly, the solution is to move source of truth to tendril replacement + etcd (for mediawiki) and remove it from icinga+ puppet [13:21:38] (I think) [13:22:00] and we are not that far away from that [13:22:35] etcd for the application state, and $tendril_replacement for pure database state [13:23:33] but I think that should be good enough for now, and we can keep it as irc alerts [13:23:47] so we don't forget to put things as read-write [13:23:51] as happened in the past [13:23:53] we are not that far away from that, but I think we are not close yet and we need the alert sooner yeah [13:24:07] so we don't get the same issue we got with s5 or something a year ago yeah [13:24:07] we don't need the alert this week [13:24:15] I am not saying this week [13:24:19] marostegui: exactly [13:24:23] or x1 last year [13:24:28] yeah [13:24:46] What I am saying is that even every hour, we'd still be in a better state than a year ago :) [13:24:51] let's review it after switchover [13:25:05] and maybe deployit [13:25:06] sounds good [13:25:15] after all it is already on multisource hosts [13:25:18] yeah, let's deploy that and the multi-instance processes one [13:25:30] now the questino is what about #2? [13:25:37] misc services? [13:25:56] why would do they need to be treated differently? [13:26:07] I think the canonical discovery rtight now is done at traffic or dns level [13:26:18] misc services are independent from mw_primary [13:26:28] e.g. we are going to failover mw but not most of misc [13:26:29] ah yes :( [13:26:56] and even if we had the app/proxy entry point [13:27:02] we need the canonica db [13:27:09] in reality how likely is to have misc on codfw in a near future? [13:27:15] quite likely [13:27:24] in a year? [13:27:24] we were supposed to have it failed over this time [13:27:30] this year [13:27:41] I was thinking just hard conding eqiad for now [13:27:43] and the only blocker is the proxy purchase and setup for codfw [13:27:58] and we are blockers for at least phab and gerrit [13:28:34] let me search the ticket [13:28:37] maybe use zarcillo? [13:29:09] https://phabricator.wikimedia.org/T164810 [13:29:12] this is for phab [13:29:26] yeah I know that one [13:29:39] and this is the tracking ticket https://phabricator.wikimedia.org/T156937 [13:30:19] only cloud, analytics and dumps are not going multi-dc for now [13:30:42] the rest should be, and should be mostly independent from each other [13:31:16] we could ask for etcd switches for each service [13:31:24] yeah, that would be a good approach [13:31:27] anyway, I just wanted to throw the question [13:31:33] not need an answer now [13:31:39] or use zarcillo, but we still need to feed it from a source [13:31:53] 10DBA, 10MediaWiki-extensions-Translate, 10Operations, 10Technical-Debt: Query returned 22222 row(s): query: SELECT * FROM `translate_metadata` on Metawiki - https://phabricator.wikimedia.org/T204026 (10mark) [14:23:53] 10DBA, 10Data-Services, 10Toolforge, 10Tracking: Certain tools users create multiple long running queries that take all memory and/or CPU from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601 (10jcrespo) [14:26:47] 10DBA, 10Core-Platform-Team, 10Structured-Data-Commons, 10Wikidata, and 4 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10daniel) [14:55:59] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10dbarratt) [14:58:36] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) Hello, We receive multiple schema changes, in order to make things easier for everyone (and to make sure your ticket is processed faster) please foll... [15:55:12] qq if you guys have time: on analytics1003 we run a mariadb database, that holds some analytics stuff etc.. (I think that you already know the host). The host runs jessie, and due to hw refresh I am testing the puppet config on stretch [15:56:14] we use ::mariadb::service, with managed => true in profile::analytics::database::meta [15:56:42] but puppet now fails for require => File["/lib/systemd/system/${vendor}.service"], [15:56:47] I think that is only for managing multi-instance hosts where we start services with systemctl mariadb@SECTION (like mariadb@s1) [15:56:57] If you only have one instance, you can probably use false? [15:58:54] we could, but I thought that having managed->true was also to avoid manual start(s) when possible (boot, etc..) [15:59:00] totally fine to have it false [15:59:20] mariadb won't start when it boots up (or it doesn't in core) [15:59:36] I am not completely sure what the managed is for though, still reading service.pp [16:00:34] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) @Marostegui No problem. I'll get it fixed up. Sorry for the confusion. [16:01:09] I rebooted an1003 yesterday and mariadb came up after booting without me doing anything [16:01:34] That is strange, we don't let mysql come automatically to avoid issues [16:01:57] elukey: you may have a managed service [16:02:07] do you have a role or profile for me [16:02:30] elukey: by default on service.pp managed is false [16:02:33] our service configuration actually allows for automatic mysql start, just it is not the default and we don't enable it on core [16:02:54] so profile::analytics::database::meta is where it is all defined, managed => true in there [16:02:56] you may likely either altered it manually or configured it on puppet (and it is ok) [16:03:03] yep yep [16:03:08] elukey: so that is normal and expected actually :-D [16:03:25] it is how non-critical dbs are recommended to be setup [16:03:52] yeah but while testing it on stretch, I can see a failure during the first puppet run due to [16:03:55] Error: Failed to apply catalog: Could not find dependency File[/lib/systemd/system/mariadb.service] for Service[mariadb] at /etc/puppet/modules/mariadb/manifests/service.pp:64 [16:03:56] critical as in- you want them fully automated and not taking care of every thing [16:03:57] elukey: I think removing the managed => true should be fine as per service.pp [16:04:23] elukey: we may not have updated it for stretch [16:04:38] send a ticket or something and I will make sure to fix it [16:05:01] as we may not have any server with stretch using the manged option and may have forgotten to make it compatible [16:05:06] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) >>! In T204006#4574533, @aezell wrote: > @Marostegui No problem. I'll get it fixed up. Sorry for the confusion. No worries at all :-) [16:05:18] elukey: sorry for the inconveniences [16:05:31] you can disable as marostegui said temporarilly [16:05:42] and reenable it after I fix it [16:05:50] does it seem reasonable? [16:06:13] wait [16:06:23] sure sure not in any rush, just wanted to have a more expert opinion :) [16:06:25] it saiys it fails to find vendor.service? [16:06:26] I am just testing in labs now [16:06:31] yeah [16:06:39] I think that puppet doesn't find it in the catalog [16:06:45] are you using wmf packages or debian ones? [16:06:47] as it requires a File resource [16:06:55] wmf packages [16:07:08] then the packages provide that file [16:07:17] that is strange [16:07:34] please file a ticket with all info [16:07:39] this is interesting to me, as it should work [16:07:46] sure but require => File["/lib/systemd/system/${vendor}.service"], doesn't mean that /lib/systemd/system/${vendor}.service needs to be defined in puppet somewhere? [16:08:25] yeah, maybe it was by the time we provided the systemd unit with puppet [16:08:29] long time ago [16:08:33] and wasn't updated [16:08:48] all right thanks for the chat people, opening a task with a summary of what we discussed :) [16:08:57] I will take care of it elukey [16:09:05] just please give us as much info you can [16:09:14] no rush, you guys have a lot on your plate in these days :) [16:09:42] there are other people in cloud using managed [16:09:43] * marostegui happy to reach the same conclusion as jynus earlier with elukey, that the package provides that file and shouldn't be failing [16:10:04] sorry, I didn't read backlog [16:10:11] No, that was between us [16:10:19] You were busy enough with other things [16:10:20] :) [16:10:30] So we did some investigation beforehand [16:13:13] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) [16:16:44] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) Thanks for amending the task. I still have some questions. 1) Why you want this change only deployed to testwiki? 2) The table creation isn't done by... [16:26:10] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) 1. We'd like the change on all production databases but I know "all databases" isn't a valid answer for your work. However, I'm new and don't know all of t... [16:26:25] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) [16:34:24] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) >>! In T204006#4574614, @aezell wrote: > 1. We'd like the change on all production databases but I know "all databases" isn't a valid answer for your w... [16:39:59] 10DBA, 10Operations, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10mark) [16:55:40] 10DBA, 10Operations, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10Joe) AIUI, the reason why we're not using MySQL (which would probably fit this storage model as well, if not better than cassandra) is just that we don't have libra... [17:06:52] not find dependency File[/lib/systemd/system/mariadb.service] for Service[mariadb] at [17:07:07] copy/paste failure, sorry [17:07:09] :) [17:12:39] 10DBA, 10Operations, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10Pchelolo) > don't have libraries and abstractions for accessing MySQL from our nodejs services. Is that correct? That's the easy part, node has great support for M... [17:16:20] 10DBA, 10Analytics: mariadb::service and managed services don't play well on Stretch - https://phabricator.wikimedia.org/T204074 (10elukey) p:05Triage>03Normal [17:17:09] created --^ [17:38:06] 10DBA, 10MediaWiki-extensions-Translate, 10Operations, 10Wikimedia-production-error: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Krinkle) [17:55:48] 10DBA, 10Core-Platform-Team, 10Structured-Data-Commons, 10Wikidata, and 4 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10daniel) [22:10:38] 10DBA, 10JADE, 10Operations, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [22:46:53] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) I'm making some changes to the proposal, which I hope emphasize the role of Judgment pages to car... [23:38:06] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) Here's a new proposal for the anatomy of a judgment (in this case, of a diff): ``` wikitext (main slot): n...