[02:37:48] 10DBA, 06Operations, 05Prometheus-metrics-monitoring: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072#2658084 (10Dzahn) p:05Triage>03Normal [02:39:52] 10DBA, 06Operations, 10ops-codfw: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2658103 (10Dzahn) p:05Triage>03Normal [06:21:34] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2658219 (10Marostegui) After the testing with db1082, these are the real sizes I got after all the alters: ``` root@db1082:/srv/sqldata# du -hd 1 253G ./dewiki 66M ./mysql 518G ./wikidatawiki 120K ./ops 16K... [07:58:11] 10DBA: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2658320 (10Marostegui) I have taken a backup of /etc/mysql and /opt at `neodymium:/home/marostegui/dbstore2001.tar.gz` @jcrespo you might want to check your /home there just in case you need to save something? [08:05:30] 10DBA, 06Operations, 10ops-codfw: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2658330 (10Marostegui) Hey @PPaul just adding you here to make sure this doesn't get missed. Thanks! [08:21:16] BTW, the user is papaul, you added a random user :-) [08:24:34] XDDD [08:24:36] damn [08:25:03] fixed [08:25:36] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2658359 (10jcrespo) Please do not do it in random hosts. I will kill it now because it has been run without proper permission. Talk to me, or... [08:34:22] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2658371 (10jcrespo) You also told me under what it is now a false pretense, that the script would not generate any local data (only another s... [08:43:25] I've just realized after almost a month of applying the schema change and almot finished [08:43:34] that I missed a table everywhere [08:44:14] so I have to start from the beginning again [08:49:29] jynus: No way!!! :( [08:49:44] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2658390 (10jcrespo) Can you clarify the duplicate keys issue? when did you get that, doing what? [09:06:58] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2658417 (10Marostegui) I was testing the following ALTER ``` alter table $i engine=INNODB,FORCE; ``` The following table complained: ``` Table: user_properties Create Table: CREATE TABLE `user_properties` ( `u... [09:11:28] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2658423 (10jcrespo) Let's check with barracuda compressed and innodb_long_prefix. And the table should have primary key, `(up_user, up_property)`; we should report that as a bug. Also, let's do a full shard check T10445... [09:14:22] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2658427 (10jcrespo) `user_properties` should be a small-ish table, we could literally export it on csv and perform a diff, as pt-table-checksum will not work here. [09:15:38] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2658428 (10Marostegui) Sounds like a safe idea. I will check the other 2-3 tables that failed too and will report back here. Also, testing barracuda and innodb_long_prefix sounds good to me. [09:16:32] aside from checking the integrity of db1081 [09:16:56] do now wast much time on the defragmenting part, that is not a priority [09:17:25] having correct data >>>> having more disk space [09:18:01] I do not think there is a data corruption problem [09:18:13] Sure [09:18:14] I think that data upgraded from lower versions [09:18:30] can read but not write under certain conditions [09:18:37] yeah, I don't think it is corrupted, but I woulnd't feel comfortable using that host a a source to clone the rest :) [09:18:47] which was exactly [09:18:54] what broke labsdb1004 [09:19:01] 's replication [09:19:02] yes [09:19:18] that is why I mentioned to have a clean logical export [09:19:28] then use that to reconstruct the others [09:19:31] Yep [09:19:53] I will check that long ticket you mention to do a data check on that shard [09:20:17] well, the issue is there is no script yet [09:20:20] it is WIP [09:20:45] what it is worse, a OK-less table is by definition, impossible to check properly [09:20:53] s/OK/PK/ [09:21:25] yep, that is worrying [09:21:31] Where should I report that? [09:21:36] by fortune, all tables are either derived ones or not so important [09:21:36] Any hint? [09:22:01] if it was page or revision, it would be more worring [09:22:14] marostegui, phabricator? XD [09:22:25] I cannot be more specific, sadly [09:22:38] No worries, just asking because maybe it sounded familiar to you or something [09:22:45] mediawiki-general-or-unknown [09:22:56] well, I think there is a ticket already open [09:23:00] let me search [09:23:24] https://phabricator.wikimedia.org/T17441 [09:23:53] good!! thanks :) [09:24:45] I would comment there the issue [09:25:01] sounds good [09:25:03] thanks [09:25:03] as it is literally creating maintenance problems a blocking a repool [09:25:48] yeah, it is not nice to not have a PK :( [09:26:05] https://phabricator.wikimedia.org/T17441#1420166 [09:27:08] I will do the same for this shard (S5) [09:48:14] 07Blocked-on-schema-change, 06Community-Tech, 13Patch-For-Review, 07Schema-change, 05WMF-deploy-2016-09-13_(1.28.0-wmf.19): Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951#2658525 (10Marostegui) This ALTER has been complet... [09:49:30] ^this is great, you did a last check [09:49:40] which I usually do before resolving [09:55:19] Yeah, I wanted to make sure I didn't leave any host :) [09:56:02] marostegui, can we hangout, I would like to fully catch up after being away for one day [09:56:12] yeah [09:56:23] Whenever you are ready [11:34:43] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2658645 (10jcrespo) Yes, ideally everything would be on InnoDB, we probably can only do it on a subset of tables for now. These should be ok in size... [12:48:48] hey [12:49:00] looks like we have about 30 databases in codfw that are 5 years old or nearly so [12:49:08] and due to be refreshed this year [12:49:14] when approximately would you like to do that? [12:51:25] 10DBA, 10MediaWiki-Database, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#197454 (10Marostegui) While doing testing I found a couple of tables without PK in S5, wikidatawiki and dewiki. This is the list of tables witho... [13:00:19] mark, the only thing I would like to push for late is s8 [13:00:41] that can be done any time next year (after the failover, maybe) [13:01:02] march-may? [13:02:50] dc failover? [13:03:17] we talked about program one for early in the year, but needs discussion [13:03:38] Ah roger that. Just getting context on what you meant. Thanks :) [13:03:55] marostegui, https://wikitech.wikimedia.org/wiki/Datacenter_failover [13:04:04] what do you mean by that jynus, "would like to push for late"? [13:04:15] thanks! [13:04:22] late in the fiscal year, mark [13:04:36] so the rest you would like earlier, or? [13:04:40] yes [13:04:44] sure [13:04:45] when? :) [13:04:48] roughly, by quarter [13:04:48] regular upgrade is more important [13:05:05] I just said it above [13:05:47] that is purchase it on 3 install on 4 quarter [13:06:01] wikidata purchase on 4 install on 4+ [13:06:08] ok [13:06:15] so those 30 codfw databases, purchase in Q3, install in Q4? [13:06:18] dbstore and stuff this quarter [13:06:23] that beings now [13:06:34] mark, yes [13:06:49] and anything else the end of Q2 [13:07:10] being what? [13:07:15] I hoped we could have some for misc on eqiad [13:07:32] ok [13:07:46] not necesarily new ones for misc [13:08:05] but rolling new ones for mediawiki, not-so-old ones for misc [13:10:08] let me write a quick 10-line proposal so you see the general idea [13:10:18] of what and when [13:10:25] that would be helpful, thanks :) [13:23:25] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2658770 (10Marostegui) So far I have only converted: `S1: enwiki/user_groups` It all looked fine but I do not want to do more tables at the same t... [13:57:17] 10DBA, 10Wikimedia-Site-requests: create a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#2658829 (10Dzahn) [14:00:18] 10DBA, 10Wikimedia-Site-requests: create a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#2658833 (10Dzahn) Added DBA to delete the old database per 1. on T126832#2025865 [14:03:41] 10DBA, 10Wikimedia-Site-requests: create a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#2658844 (10jcrespo) Can someone make a separate task, I do not know that #DBA s are supposed to do here? Dropping a single production database will not cleanup things for ptwikimedia, there are at... [14:08:44] 10DBA, 10Wikimedia-Site-requests: create a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#2658862 (10Dzahn) It seems the problem is nobody knows what needs to be done to unblock it, since the options were either "rename database" or "delete database" but both have been rejected for tec... [14:10:25] hey yall, it looks like the eventlogging replication script is not inserting events for a while [14:10:55] some tables are a few hours behind, one of them is 24 hours behind [14:11:25] not sure if that's expected or if I should file a task [14:13:45] milimetric, which host¿ [14:15:31] milimetric, can you tell me a table you would exect to be always receiving events? [14:16:13] jynus: select max(timestamp) from Edit_13457736; [14:16:30] (that's the latest editing schema, which gets events all the time) [14:16:50] and that one's showing 20160922074957 for the replica [14:17:34] maybe there is a delay because a lot of events were buffered [14:18:25] the buffer in our code? That one maxes out at 5 minutes / 3000 events [14:18:50] no, on the replica itself [14:18:53] elukey: I was just saying 20160922074957 is the last timestamp for Edit_13457736 which I would expect to get events all the time [14:19:39] I am not super expert about EL though :( [14:20:02] I am checking, the replication is working [14:21:03] but if things were down for 1 day I wouldn't be surprised to find "lag" on restart [14:22:33] I will wait for it to arrive at Edit Table [14:22:57] it is now on it [14:23:03] it should have been updated [14:23:13] I am looking at db1047 [14:23:34] it ran and should have inserted the missing records you mean? [14:23:37] It says 20160922141158 [14:23:41] yeah [14:23:48] which is ok, right? [14:23:53] yeah, that's great [14:23:59] but the replica says 20160922074957 sitll [14:24:17] ok, so for now I would say if you are in an emergency, use analytics-slave [14:24:22] I will now check analytics-store [14:24:34] no emergency, just wondering if it's normal lag or exceptional enough to raise to you [14:25:17] Oh, "Access denied for user" [14:25:26] something is wrong there [14:25:36] sounds like it :) [14:25:47] probably I know what it is [14:26:01] we had some grant security issues [14:26:14] the day before yesterday [14:26:29] I may had gone too far [14:26:48] it kind of fixes itself [14:26:57] but I will give it a bump [14:27:06] 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2659030 (10Halfak) [14:28:13] in fact, it was a puppet issue of how it was done [14:28:47] I see, ok, thanks for looking into it jynus [14:29:40] 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2627432 (10Halfak) Assigning this to @Ladsgroup because it looks like the last thing to do is to run the maintenance script. We can... [14:29:49] milimetric, it should be fixed now [14:29:59] it may take some time to catch up [14:30:08] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2659065 (10Halfak) [14:30:14] the replication is half-done [14:30:24] we need to improve it and make it more resilient [14:31:17] it is now replicating correctly and it will backfill automatically, too [14:31:46] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2627432 (10Halfak) p:05Triage>03High [14:31:49] milimetric, we have 2 servers so in case one breaks, the other continues working :-) [14:31:57] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog, 15User-Ladsgroup: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2659070 (10Halfak) a:03Ladsgroup [14:32:03] but thanks for the report [14:32:51] cool, thx and I'll keep in mind to only bug you if both break [14:32:59] no no [14:33:07] elukey: lemme know if you wanna chat about this and how it works [14:38:52] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog, 15User-Ladsgroup: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2659094 (10jcrespo) Can I have a look at it before it is run? It is very easy to create lag by accident,... [14:42:39] 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review, 07Performance: hidenondamaging=1 query is extremely slow on enwiki - https://phabricator.wikimedia.org/T146111#2650989 (10Halfak) [14:49:41] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659145 (10GWicke) I hear your concerns, and am happy to wait until the migrations are done. Let me also clarify some assumptions. I did che... [14:56:59] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659184 (10jcrespo) So, let me first suggest using terbium as the primary maintenance node. That will make sure that worst case scenario, onl... [15:02:49] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659213 (10GWicke) > Third, how long do you think it will take, can you run them everywhere except s1 (where the alter is currently running)... [15:06:24] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659217 (10jcrespo) Note that 9 hours later, the script was still running (!). Please add it to the calendar to "book" the time before anyon... [15:17:45] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659247 (10GWicke) > Note that 9 hours later, the script was still running (!). Looking at the timestamps, basically all that time was spent... [15:22:20] If you are still around, a quick look at https://gerrit.wikimedia.org/r/301076 would be nice [15:22:52] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog, 15User-Ladsgroup: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2659254 (10Halfak) @jcrespo yes please. I'm confirming now what should be reviewed. [15:26:47] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659276 (10GWicke) Sooo, if running this query counts as a deploy, should we also wait until after next week? [15:29:14] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659281 (10jcrespo) Releng are the kings here, ask them. :-) I would be ok with it, as technically it is not a deployment? I do not know. Pr... [16:00:49] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659340 (10greg) Re timing of scripts (ie: the policy says anything over 1-ish hour): If you will be running multiple runs of the same script... [16:02:58] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659355 (10jcrespo) @Greg, yes, thinking it better, I won't be monitoring things, so it is the better option. It happens that I am easily con... [16:55:00] 07Blocked-on-schema-change, 06Community-Tech, 13Patch-For-Review, 07Schema-change, 05WMF-deploy-2016-09-13_(1.28.0-wmf.19): Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951#2659481 (10kaldari) @Marostegui: This can be close... [17:26:39] 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2659560 (10jcrespo) Accoording to lifecycle logs "System is turning off." is the cause of the issue. No logs. No signs of a crash. Logs continue as usual: ``` Sep 19 01:09:39 db1061 sshd[47512]: Set /proc... [17:27:11] 10DBA, 06Operations, 10ops-eqiad: Investigate db1061 crash - https://phabricator.wikimedia.org/T146018#2659561 (10jcrespo) 05Open>03Resolved I think there is not much left to do here, except wait if it happens again. [17:55:51] 10DBA, 10MediaWiki-extensions-ORES, 10Revision-Scoring-As-A-Service-Backlog, 15User-Ladsgroup: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2659682 (10Halfak) @jcrespo see T145503#2659669 which references https://gerrit.wikimedia.org/r/#/c/312286/ [18:50:38] 10DBA, 10RESTBase-Cassandra, 06Services-next, 13Patch-For-Review: Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2659972 (10GWicke) I see where you are coming from & generally share the "better safe than sorry" approach. We will wait another week. That... [21:30:32] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2660564 (10greg) I went ahead and [[ https://www.mediawiki.org/w/ind...