[02:22:04] 10DBA, 10MediaWiki-Database, 10Tracking: Duplicate key errors (tracking) - https://phabricator.wikimedia.org/T106854 (10Krinkle) [05:24:51] 10DBA, 10Performance, 10Security: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers) - https://phabricator.wikimedia.org/T195578 (10Marostegui) #security any comments about ^ [05:46:23] 10DBA, 10Operations, 10Patch-For-Review: Puppetize grants for mysql analytics servers - https://phabricator.wikimedia.org/T114476 (10Marostegui) 05Open>03Resolved I have renamed the file from `research-grants.sql.erb` to `analytics-grants.sql.erb` so we can have all the users that are actually active (T2... [06:13:34] 10DBA, 10MediaWiki-extensions-Translate, 10Operations, 10Datacenter-Switchover-2018 , 10Wikimedia-production-error: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10jcrespo) [06:34:51] 10DBA, 10Operations, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10jcrespo) > If we had some other MySQL cluster that would be the best option 2.2GB of data is a ridiculous small amount of data, and it would fit comfortably in one... [09:21:11] this is the basic check I am going to implement to check backups: [09:21:21] section='s1'; dc='eqiad'; mysql.py -A -BN -h db1115 zarcillo -e "SELECT * FROM backups WHERE section='${section}' AND status='finished' AND host like '%.${dc}.wmnet' AND end_date IS NOT NULL AND end_date > now() - INTERVAL 9 day ORDER BY end_date DESC LIMIT 1" [09:21:23] 116 dump.s1.2018-09-11--23-31-16 finished dbstore1001.eqiad.wmnet:3311 dbstore1001.eqiad.wmnet dump s1 2018-09-11 23:31:16 2018-09-12 01:59:36 139910495997 [09:24:06] so you will exclude the backups that do not match that criteria then? [09:27:34] not sure what you mean- the check will do that query and give an alert if no rows are found [09:30:10] I was thinking the other way around, that the alert would check status != 'finished' and alert if there are rows [09:30:16] It is the same thing anyways :) [09:32:07] no, we need to check the existance not the not existance of the oppite [09:32:22] e.g. if the source host is down, it will not write anything to the db [09:32:28] not creating any alert [09:32:44] But that would be captured with the INTERVAL 9 days no? [09:33:08] ok, I know what you mean [09:33:18] you want to alert if a backup fails [09:33:28] Yeah [09:33:29] that is doable, but not what I want to do first [09:33:41] Sure, I was just adding food for thought :) [09:33:41] reember I sad "this is my backup check" [09:33:50] *basic [09:33:56] correct! [09:33:57] we can add other stuff [09:34:20] e.g. send email or whatever when whe know things have failed or stuck [09:34:40] but the basic health check is "we have fresh backups?" [09:34:43] yep :) [09:34:53] one step at a time :-) [09:35:01] that's the key :) [09:35:24] the other things I may not put on icinga [09:35:29] but on a report page or email [09:35:47] +1 to mail for now indeed [09:56:48] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) >>! In T200297#4575766, @awight wrote: > I'm making some changes to the proposal, which I hope em... [10:15:24] 10DBA: Investigate using macbre/index-digest over WMF MW sql query workload - https://phabricator.wikimedia.org/T180158 (10Marostegui) p:05Triage>03Low [10:19:17] 10DBA: Investigate using macbre/index-digest over WMF MW sql query workload - https://phabricator.wikimedia.org/T180158 (10jcrespo) This is interesting, and we know there are issues with mediawiki schema, however there are more pressing issues that are plain wrong- while this falls more on the optimization categ... [10:23:23] 10DBA, 10Cloud-Services, 10Cloud-VPS: Database upgrade MariaDB 10: Metadata access in INFORMATION_SCHEMA causes complete blocks - https://phabricator.wikimedia.org/T71182 (10Marostegui) 05Open>03Resolved a:03Springle I am going to close this as resolved with the creation of information_schema_p and the... [10:28:06] great work on cleanup, manuel [10:31:39] Thanks :-) I try to do it once a week so we can get rid of stuff and also get a more realistic view of what we are working on, and what looks like that will come next [10:31:46] So we have a clearer dashboard [10:36:41] 10DBA, 10Analytics: mariadb::service and managed services don't play well on Stretch - https://phabricator.wikimedia.org/T204074 (10jcrespo) I think removing the require would suffice, this was a relic from when we setup the systemd unit with puppet, which we don't do anymore. If for some reason the systemd un... [11:49:57] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) [12:19:35] I just checked that all codfw masters are STATEMENT (not x1, of course) [13:26:11] I see SHOW GLOBAL STATUS from prometheus running for a long time on some db2* hosts [13:26:46] it is stuck [13:26:52] oh yes [13:26:55] i see it too [13:26:55] on db2068 [13:26:59] an db2054 [13:27:00] and 54, several runs [13:27:21] and those two are from s7 [13:27:26] s7 replicas [13:27:37] I am going to kill tendril and prometheus processes [13:27:40] yeah [13:27:44] cause those are lagging [13:28:02] very strange that those two are the only ones and they are s7 [13:28:56] updating | DELETE /* MessageGroupStats::clearGroup */ FROM `translate_groupstats` W [13:29:47] could that be related to the disablement? [13:30:09] could be [13:30:17] that DELETE is HUGE [13:30:20] but a disable that creates large drops? [13:30:36] maybe that is the deletion triggered by Hauskatze? [13:30:58] it is an insane delete [13:31:32] we cannot just but wait [13:31:41] is is causing issues on eqiad? [13:31:51] nop [13:31:55] No slaves lagging at least [13:34:05] is it not affecting the other hosts? [13:34:09] on s7? [13:34:13] marostegui: do you mean the edit I did to fix the Meta:Example link? [13:34:14] no [13:34:18] only those 2 [13:34:25] Hauskatze: Did you delete the page in the end? [13:34:33] marostegui: I deleted nix. [13:34:50] what is special about those hosts? [13:35:05] jynus: I was checking if those were the rc ones, but they aren't [13:35:10] or maybe the Meta:Example translations (only 4 pages) marostegui [13:35:30] Hauskatze: Don't know really, I was wondering because it is a DELETE that kinda matches the times, but who knows [13:35:50] Hauskatze: https://phabricator.wikimedia.org/P7535 [13:36:22] marostegui: given the MessageGroupStats thing above, it was either me removing Meta:Example from translation -or- deleting translations posted at Meta:Example [13:36:24] marostegui: I think it may have forced a metadata update [13:36:38] that is why it is blocking prometheus and tendril [13:37:02] yep [13:37:27] (side note: if you think the Translate extension is not DB-friendly, please file tasks so the language team can enhace the tool) [13:37:49] I am disabling the scheduler [13:37:50] Hauskatze: Will do! At to check if that DELETE can be done in some other way [13:38:05] on db2068 [13:38:18] jynus: Should we also ease replication options to reduce some load and check if the delete finishes faster? [13:38:27] marostegui: perfecto :) --> (hora de la siesta) [13:38:28] it is a single query [13:38:31] and put less stress on the disk [13:38:35] we cannot do much [13:38:37] Hauskatze: enjoy! [13:39:03] maybe prepare a depool patch? [13:39:21] sounds reasonable [13:39:23] I will do it [13:39:42] can you check pt-diff to see if those hosts have some different running options? [14:10:01] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Halfak) With the wikitext slot, we won't know which note relates to which judgement. This is like having one big "not... [14:34:33] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Halfak) For clarity, here's a rough version of the endorsements proposal that I'd originally put together about a year... [14:37:43] 10DBA: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 (10Marostegui) p:05Triage>03High [15:10:54] 10DBA, 10Datacenter-Switchover-2018: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 (10jcrespo) [16:49:50] 10DBA, 10Operations, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10akosiaris) [17:01:00] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) >>! In T202596#4577106, @Halfak wrote: > With the wikitext slot, we won't know which note relates to which jud... [18:02:11] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Marostegui) With `course_token` you mean that column on the `ep_courses` table? If that is the case, there is no need to play around with the vie... [18:03:35] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4576584, @daniel wrote: > Note that it's blocked on {T204112}. That's not particul... [18:11:12] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Another change which I'll document here is that I'm dropping the use cases for "write-only" workf... [18:59:27] 10DBA, 10Core-Platform-Team, 10Structured-Data-Commons, 10Wikidata, and 4 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10daniel) [19:10:44] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) [19:14:20] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10aezell) @Marostegui I've updated the description to include `all.dblist` and the team will SWAT the create table that is required for this project. [20:05:46] 10DBA, 10Datacenter-Switchover-2018: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 (10Marostegui) We could probably reclone one of these hosts (for example db2054) from an eqiad slave, and then move it under codfw master. That way we don't have to depool an active codfw s7 slave, as tha... [20:21:04] 10DBA, 10Datacenter-Switchover-2018: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 (10Banyek) I can work on this if somebody shows me how to clone hosts (normally I'd use xtrabackup -> tar -> netcat -> netcat -> tar but I think that is a no-no with mariadb) [20:40:24] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Halfak) > The main thing driving us to that conclusion was that a "notes" field should be shared between damaging and... [20:59:56] 10DBA, 10Datacenter-Switchover-2018: Reclone db2054 and db2068 - https://phabricator.wikimedia.org/T204127 (10jcrespo) >>! In T204127#4578185, @Marostegui wrote: > We could probably reclone one of these hosts (for example db2054) from an eqiad slave, and then move it under codfw master. That way we don't have... [22:06:39] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Reedy) Easy done then :) [22:29:44] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) with the migration to notedb accounts and changes have been removed from the... [22:57:02] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) You are saying we won't need any mysql/mariadb for Gerrit anymore? [22:58:14] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Yep, but currently 2.x will still require a db just 2.15 does not read change... [23:02:31] 10DBA, 10Operations, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) quarry doesn't use the mysql module anymore since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/454481/ was merged.... [23:08:52] 10DBA, 10Growth-Team, 10MediaWiki-extensions-CentralAuth, 10Notifications, and 2 others: CentralAuthCreateLocalAccountJob failing on meta due to Echo deadlocks - https://phabricator.wikimedia.org/T121161 (10Krinkle) 05Open>03declined Not seen in Logstash for at least 30 days. [23:09:04] 10DBA, 10Growth-Team, 10MediaWiki-extensions-CentralAuth, 10Notifications, 10Wikimedia-production-error: CentralAuthCreateLocalAccountJob failing on meta due to Echo deadlocks - https://phabricator.wikimedia.org/T121161 (10Krinkle)