[06:10:03] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [06:11:09] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:35:46] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [08:39:28] 10DBA, 10Analytics, 10Analytics-Kanban: Migrate users to dbstore100[3-5] - https://phabricator.wikimedia.org/T215589 (10elukey) p:05Triage→03High [08:41:11] 10DBA, 10Analytics, 10Analytics-Kanban: Migrate users to dbstore100[3-5] - https://phabricator.wikimedia.org/T215589 (10elukey) [08:42:01] 10DBA, 10Analytics, 10Analytics-Kanban: Migrate users to dbstore100[3-5] - https://phabricator.wikimedia.org/T215589 (10elukey) [08:42:44] 10DBA, 10Analytics, 10Analytics-Kanban: Migrate users to dbstore100[3-5] - https://phabricator.wikimedia.org/T215589 (10elukey) [08:43:55] 10DBA, 10Analytics, 10Analytics-Kanban: Migrate users to dbstore100[3-5] - https://phabricator.wikimedia.org/T215589 (10elukey) @leila @Halfak Hi! The new dbstore100[3-5] hosts are ready, so I'd ask your teams to start using those and see what's missing/not-working/etc.. Let me know! [09:17:42] hello people [09:18:18] I have a thought about DNS records for the new dbstore hosts, just wanted to know what you guys think about it [09:19:04] I am wondering now if some SRV records could ease the use of the dbstores, since I already got some people confused by the ports etc.. [09:19:06] something like [09:20:11] _s1-analytics._tcp.eqiad.wmnet IN SRV 0 10 3321 s1-analytics-replica.eqiad.wmnet [09:20:14] etc.. [09:20:51] I don't have any strong opinion about either way, whatever you (and the final users) find easier and more convenient [09:21:19] I do see how that can be easier and less confusing indeed [09:21:27] this is more for code, since I'd need to write the glue to replace the usage of analytics-store.eqiad.wmnet [09:21:42] basically I DNS query would be sufficient [09:21:49] rather than hardcoding the logic [09:22:04] yeah [09:23:07] I am going to prepare a code change, so I can ask to the SRE team if it would be a legit usage [09:23:10] thanks! [09:23:16] thank you! [09:35:53] Hi there! jynus would you please have a minute? [09:36:14] sure [09:36:31] I noticed a new slow query on enwiki and I'm wondering if it has to do with the task you helped with [09:36:43] This query is very simple: SELECT afl_id FROM `abuse_filter_log` WHERE afl_filter = '9' ORDER BY afl_timestamp DESC LIMIT 51 [09:36:55] Apparently the filter_timestamp index is ignored, and it takes one minute to execute [09:37:05] Could you please check if enwiki has the filter_timestamp index in place? [09:37:19] Is the query literally that^ [09:37:33] Not really, but this is the minimum faulty version [09:37:45] yeah, I mean mostly the constant [09:38:00] Yes [09:38:08] Here is the full version https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=enwik&qmode=eq&query=abuse_filter_log&hours=14 [09:38:30] EXPLAINing on quarry says no index is used [09:38:42] explain on production says it will use afl_timestamp [09:39:26] Like it says on quarry [09:39:28] I don't see any filter, timestamp index [09:39:33] But it should use filter_timestamp [09:39:36] Ack [09:39:47] yeah, but it doesn't exist :-) [09:39:57] I can check on other hosts [09:40:06] Uhm, actually I wonder if this is https://phabricator.wikimedia.org/T187295 [09:40:39] please link me to the on paper latest release [09:40:49] of the sql structure [09:40:52] and I will compare it [09:41:17] https://phabricator.wikimedia.org/diffusion/EABF/browse/master/abusefilter.tables.sql$34-61 [09:42:05] please, please, please give a name to indexes in the future [09:42:20] or bad things could happen on deletion [09:42:30] I mean explicit names [09:43:11] yeah, production looks nothing like that [09:43:12] Which indexes? Am I missing something ...? [09:43:17] Noice [09:43:32] KEY (afl_timestamp), [09:43:41] yeah, it will give a default name [09:43:55] until it does not because some complex migration [09:44:19] Ah I see [09:44:27] well, there is at ticket for that, it will be fixed [09:44:28] Those indexes are pretty old, though [09:44:55] Alright, it's enough to confirm that it's the same issue, thanks again! [09:46:40] Daimona: most of those issues are blocked on https://phabricator.wikimedia.org/T104459 [09:47:02] sadly, in the past, people deployed code without deploying schema changes [09:47:12] and viceversa, and that lead to the current state [09:47:23] it took us 2 years to fix the revision table [09:47:41] Ouch, that's bad [09:47:57] we need a metadata database to check the differences [09:48:13] which is a todo, but there is only so much we can do at a time [09:48:59] I see there's a lot to do [09:49:09] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10jcrespo) [09:49:13] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [09:53:28] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [09:53:30] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127 (10jcrespo) [09:53:44] 10DBA, 10Data-Services: Discrepancies with logging table on different wikis - https://phabricator.wikimedia.org/T71127 (10jcrespo) [09:53:52] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [09:56:57] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [09:57:19] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [09:57:45] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [09:57:55] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [10:13:59] marostegui: I said this because I almost do- remember not to restart one of those hosts that have multiple mysqls without depooling all [10:14:05] do you know what saved me? [10:14:15] unmounting /srv and the && [10:14:46] I always run a ps aux before rebooting, I learned that the hard way years ago haha [10:14:50] so it is now kinda automatic [10:14:58] I do pgrep mysqld [10:15:06] but I skipped it this time [10:15:22] umounted saved you haha [10:15:24] the unmounting /srv failed and that prevented the issue [10:16:19] we could automate the reboot, specially when dynamic configuration is in place [10:20:37] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [10:20:39] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10jcrespo) [10:21:12] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [10:21:15] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10jcrespo) [10:21:28] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [10:21:31] 10DBA, 10Patch-For-Review: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 (10jcrespo) [10:21:51] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [10:21:57] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10jcrespo) [10:22:13] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10jcrespo) [10:22:17] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10jcrespo) [10:22:26] you can ignore the noise, it is related to T104459 [10:22:26] T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 [10:22:31] haha yeah [10:22:32] I see [10:22:37] tasks should depend on it [10:22:43] not the other way around [10:23:11] adding automation is not blocked by adding an index, adding and index is blocked by automation [10:23:20] even if we can also fix those manually [10:23:55] and inventory also relates to backup validation [10:24:24] the parent/subtask is confusing [10:25:22] it is the naming [10:25:30] and the duality of bug vs task [10:25:36] a but is blocked by something [10:25:54] a task has certain subtasks [10:26:25] but a bug doesn't really have subtasks [10:27:08] yes, it is the naming what makes it confusing [11:03:41] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [11:04:06] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [12:34:07] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10MoritzMuehlenhoff) The server went down at 12:16, with a number of memory errors logged in SEL: ` ------------------------------------------------------------------------------- Record:... [13:25:17] jynus: should we raise priority for db1114 ticket? [13:25:43] pff, I don't think a firmware upgrade will fix that, as moritz said [13:26:00] No, I didn't mean the upgrade, but the whole thing [13:26:03] Contacting dell etc [13:26:10] Or at least swapping memory DIMMS [13:26:12] sure, I am just [13:26:15] like [13:26:20] mmm [13:26:49] I think what cmjohnson1 normally does is swap the DIMMs, so if it crashes again and log errors, he can contact Dell and get them replaced [13:26:58] not against anything of that [13:27:08] just that I think we need more drastic changes [13:27:17] I would propose to switch roles with db1118 [13:27:25] +1 [13:27:31] we cannot just get a crash on s1 every now and then [13:27:32] but we still need to do the troubleshooting :) [13:27:41] yes, I am just thinking beyond that [13:27:46] ok, let's do that [13:27:58] we swap roles [13:28:03] replace db1114 with db1118 and also work with chris to get this handled from a physical point of view [13:28:08] then we do debugging without needing to depool [13:28:23] do you want me to reimage db1118? [13:28:36] and apply all the puppet changes etc? [13:30:48] so if you don't mine I would like to do it [13:30:57] to test more the backup system [13:31:01] sure! I was trying to offload stuff from you :) [13:31:08] I will leave it to you! [13:31:12] I should be doing backups! [13:31:17] haha [13:31:27] all yours! [13:34:03] can I ask you to do what I was going to do instead? [13:34:14] sure! [13:34:16] what is it [13:34:21] file a ticket as a followup of "mw bad, spof, help" [13:34:30] the LB didn't work?! [13:34:44] with "thanks, but now we broke logstash" [13:34:50] hahah [13:34:59] ask if you need details [13:35:21] but basically kafka/elastic overloaded due to the 300K logs per minute [13:35:27] lovely [13:35:43] I will create a ticket and we can fill it out and then add the appropiate tags [13:35:48] so there is probably something that could be done either at mw or infra side [13:36:28] add jijiki as she was monitoring the situation [13:36:33] will do [13:37:01] filippo too, as he probably was the person that implemented the glue, even if only to provde feedback [13:37:36] yup [13:53:01] Going to downtime db1114 [13:54:15] marostegui: I did disable alert IIRC [13:54:29] yes, I did [13:54:34] yeah, notifications are disabled, I am going to downtime it to avoid it showing up on icinga UI [13:54:40] ok to me [13:54:48] I feared I hadn't [13:55:04] you did, I am also going to commit it on puppet, so it is consistent [13:55:30] if you want to see something there now, it is the moment [13:56:59] https://gerrit.wikimedia.org/r/489210 [15:11:09] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1118.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20190208151... [15:28:38] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1118.eqiad.wmnet'] ` and were **ALL** successful. [15:44:37] ./wmfmariadbpy/wmfmariadbpy/transfer.py --type xtrabackup --no-encrypt dbstore1001.eqiad.wmnet:/run/mysqld/mysqld.s1.sock db1118.eqiad.wmnet:/srv/sqladata [15:44:44] ERROR: The specified target path /srv/sqladata doesn't exist on db1118.eqiad.wmnet [16:07:13] 10DBA, 10Core Platform Team (MCR), 10Core Platform Team Backlog (Later), 10Multi-Content-Revisions (Tech Debt), 10Schema-change: Once MCR is deployed, drop the rev_text_id, rev_content_model, and rev_content_format fields to be dropped from revision - https://phabricator.wikimedia.org/T184615 (10daniel) [16:07:59] 10DBA, 10Core Platform Team (MCR), 10Core Platform Team Backlog (Later), 10Multi-Content-Revisions (Tech Debt), 10Schema-change: Once MCR is deployed, drop the rev_text_id, rev_content_model, and rev_content_format fields to be dropped from revision - https://phabricator.wikimedia.org/T184615 (10daniel) [16:09:10] ^that saved me from wasting 2 hours [16:10:25] because I am copying from dbstore1001, it may take up to 2 hours (very slow disk reads) [16:10:37] nice! [16:10:49] 10DBA, 10SDC Engineering, 10Wikidata, 10Core Platform Team (MCR), and 5 others: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 (10daniel) 05Open→03Resolved a:03daniel This is done. We still have the old fields in the database, and we still write to them. Changing that is tracked... [17:38:41] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) Transfer of 1h and 20m, probably sped up because I stopped replication (avoiding to replay many changes). [18:14:53] 10DBA, 10Cloud-VPS, 10MediaWiki-Commenting: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas - https://phabricator.wikimedia.org/T166798 (10Bstorm) 05Open→03Invalid This is already done, much less decided. We have decided that providing back-compat... [18:35:26] 10DBA, 10MediaWiki-Database, 10PostgreSQL, 10Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 (10Krinkle) [18:50:52] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) Except for the above 3 patches, db1118 should be ready to go (not done so late in the week for obvious reasons). [18:58:36] 10DBA, 10Cloud-VPS, 10MediaWiki-Commenting: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas - https://phabricator.wikimedia.org/T166798 (10Anomie) 05Invalid→03Resolved I hope you don't mind, I'm going to change this from "invalid" to "resolved". T... [18:59:28] 10DBA, 10Cloud-VPS, 10MediaWiki-Commenting: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas - https://phabricator.wikimedia.org/T166798 (10Bstorm) Fair enough :) [20:32:33] 10DBA, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) [20:45:32] 10DBA, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) [20:46:49] 10DBA, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) a:05Ottomata→03Dzahn Thanks Daniel! The Analytics usages are gone. I'm assigning... [22:27:35] 10DBA, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [22:28:15] 10DBA, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) [22:30:19] 10DBA, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) >>! In T162070#4939472, @Ottomata wrote: > Thanks Daniel! The Analytics usages are gone....