[04:56:04] 10DBA: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Marostegui) Nothing on HW logs that could indicate something is wrong with storage or similar. [04:58:32] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4590949, @Smalyshev wrote: > I am a bit confused by now - is the original problem because recen... [05:19:02] 10DBA: db1118 crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Marostegui) I started an alter table early in the morning on eqiad master for externallinks (130G) and recentchanges, but that went thru at around lunch time, so I don't think it could be the cause of the crash, but maybe... [05:26:24] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) @Smalyshev - the indexes have been removed from the API hosts. The queries on those two servers (db2080 and db... [05:41:57] 10DBA: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10Marostegui) p:05Triage>03High Triaging this as high as it is the backup source. I think we should just reimport s2 there. [06:13:21] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) The API requests for recentchanges now seem to be faster, but I still get exceptions in the log :( I also get a... [06:17:55] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4592236, @Smalyshev wrote: > The API requests for recentchanges now seem to be faster, but I st... [06:25:20] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) @Smalyshev eqiad and codfw are not the same. The index only exists on recentchanges replicas and the masters (... [06:31:28] 10DBA, 10Operations, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10Marostegui) >>! In T100501#3371873, @jcrespo wrote: > Blocked on full stretch migration. So only pending labsdb1004,labsdb1005, dbstore1002 and the parsercache? [06:48:48] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) p:05Low>03Normal I have altered testwiki on db2033 for now. [06:56:21] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [07:15:32] 10DBA: db1118 crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10jcrespo) p:05Triage>03Low [07:17:48] 10DBA: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10jcrespo) a:03jcrespo I will do that, I'll see from where. [07:34:34] 10DBA, 10Operations, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10jcrespo) > So only pending labsdb1004,labsdb1005, dbstore1002 and the parsercache For those, a non-system user is used, that is why https://gerrit.wikimedia.org/r/454291... [07:53:31] no trolling intended at all, but I was reading PostgreSQL 11's release notes last weekend [07:53:35] some cool stuff there! [07:53:59] e.g. JIT compilation of certain expressions using LLVM caught my eye [07:55:38] sure, because inerpreting SELECT * FROM table where id=X (aka 99.999% of our queries) is where the performance loss is :-) [07:56:12] while the other 0.001% is one of the 300 million possible ways of quering recentchanges [07:57:34] fix update secondary indexes overhead, and you get my attention [08:05:16] I wasn't mentioning in the context of wikimedia, just as a general "cool tech" comment [08:05:59] well, postgress being cools is obvious [08:06:08] to me, I mean [09:04:26] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 (10Marostegui) This bite us today when changing some indexes, so I am going to start deleting these tables in core. They were renamed at T153638#3097450 and nothing broken, and nothing has broken since codfw is active.... [09:17:38] 10DBA: db1118 crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Banyek) a:03Banyek [09:18:50] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10jcrespo) [09:19:30] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 (10Marostegui) From s3 the following wikis will need to get the table deleted: {P7560} From s7: `kowiki` The rest of sections are clean (apart from the places where it is really used as described in T153638#4592556 [09:19:32] e.g. you can try starting replication just to check if innodb replication consistency works [09:19:43] that is for banyek^ [09:19:54] :) [09:25:54] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Banyek) journalctl -u mysql shows errors: Sep 14 15:13:50 db1118 mysqld[15915]: 2018-09-14T15:13:50.377830Z 41 [Warning] [MY-010897] [Server] Storing MySQL user name or password information in the master info... [09:35:29] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Banyek) 05Open>03Resolved These errors look like the replication in not compatible: ``` Sep 18 09:29:37 db1118 mysqld[18383]: 2018-09-18T09:29:37.803219Z 23452 [Warning] [MY-010584] [Repl] Slave I/O for... [09:35:51] ^ I didn't wanted to close it [09:35:59] You can reopen it [09:36:13] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Banyek) 05Resolved>03Open [09:39:23] 10DBA: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 (10Banyek) 3 minutes later it crashed with sig11 ``` Sep 18 09:32:59 db1118 systemd[1]: mysql.service: Main process exited, code=killed, status=11/SEGV ``` [09:44:15] 10DBA: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 (10Marostegui) [09:44:18] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Rename two indexes in the Echo extension - https://phabricator.wikimedia.org/T51593 (10Marostegui) [10:19:48] connection errors seem clean now for 20 minutes [10:19:55] now trying the s2 depool [10:20:34] great! [10:36:43] we are bit low on space on dbstore2002 [10:37:17] uncompressed sqldata barely fits [10:37:46] only 300 GB will be left, enough to run the compression [10:45:52] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10jcrespo) Cloning is ongoing from db2041 to dbstore2002 /srv/sqldata, will rename later to /srv/sqldata.s2 and restart replication there quickly. After the backups run, some compression process will have to be rerun... [11:03:36] 10DBA, 10Operations, 10ops-codfw: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10jcrespo) a:05jcrespo>03None [11:03:57] 10DBA, 10Patch-For-Review: Test MySQL 8.0 with production data and evaluate its fit for WMF databases - https://phabricator.wikimedia.org/T193226 (10jcrespo) a:05jcrespo>03None [11:06:42] 10DBA, 10Patch-For-Review: Finish eqiad metadata database backup setup (s1-s8, x1) - https://phabricator.wikimedia.org/T201392 (10jcrespo) a:05jcrespo>03None [11:07:07] 10DBA, 10Epic, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [11:07:10] 10DBA, 10Patch-For-Review: Finish eqiad metadata database backup setup (s1-s8, x1) - https://phabricator.wikimedia.org/T201392 (10jcrespo) 05Open>03stalled Blocked on new backup hardware setup. [11:09:16] 10DBA, 10Data-Services, 10MediaWiki-General-or-Unknown, 10Security: Make (redacted) log_search table available on ToolLabs - https://phabricator.wikimedia.org/T85756 (10jcrespo) a:05jcrespo>03None [11:09:54] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10jcrespo) a:03jcrespo [11:20:14] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10jcrespo) List of tables to archive: ``` root@labsdb1010:~$ mysql -BN --skip-ssl -e "select table_schema, table_name FROM information_schema.tab... [12:55:29] 10DBA, 10MediaWiki-Special-pages, 10Performance, 10Regression: Slow access to Special:Contributions on mediawiki.org - https://phabricator.wikimedia.org/T204669 (10hashar) From profiling (with https://www.mediawiki.org/wiki/Special:Contributions/Andrew0005?forceprofile=1 ) that comes from `ContribsPager::r... [13:08:26] 10DBA, 10MediaWiki-Special-pages, 10Performance, 10Regression: Slow access to Special:Contributions on mediawiki.org - https://phabricator.wikimedia.org/T204669 (10hashar) I have quickly poked the #DBA to check whether something was going on, and the servers looks all fine. From our Logstash, a query for... [13:16:59] this might be interesting for the dbproxy hosts? https://tracker.debian.org/news/988151/accepted-prometheus-haproxy-exporter-090-1-source-amd64-into-unstable-unstable/ [13:19:32] moritzm: thanks, it is indeed very interesting [13:19:42] also for haproxy on thumbor [13:20:06] we were thinking of doing it ourselves, much less work now [13:20:17] CC godog ^ [13:21:07] indeed, gilles also has been working on backporting haproxy-exporter to jessie, successfully [13:21:33] I thought I cc'd you on the task, clearly not [13:21:40] T204266 [13:21:41] T204266: Backport prometheus haproxy exporter for Jessie - https://phabricator.wikimedia.org/T204266 [13:25:40] 10DBA, 10MediaWiki-Special-pages, 10Performance, 10Regression: Slow access to Special:Contributions on mediawiki.org - https://phabricator.wikimedia.org/T204669 (10Anomie) a:03Anomie The specific query in T204669#4593536 is due to the enabling of actor table WRITE_BOTH mode on mediawiki.org (07a6db7), so... [13:36:49] marostegui: banyek it would be great to test if what motivated the reimage, the bad performance of ApiQueryRevisions::run, is now fixed on db1114 [13:39:56] jynus: ok [13:42:47] banyek: will fill you on that at another time [13:42:59] *in [13:55:08] our architecture has some things positive- the dbstore2002 crash only affected s2, while in the past it would have affected all sections (due to multisource), and the recovery more painful [13:55:17] so this was a win [14:06:36] 10DBA, 10Patch-For-Review: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 (10Banyek) The recloning finished, but I don't close the ticket: it needs to be checked if the plan is working good again. [14:12:31] 10DBA, 10Patch-For-Review: dbstore2002 s2 crashed - https://phabricator.wikimedia.org/T204593 (10jcrespo) p:05High>03Normal a:05jcrespo>03None dbstore2002:s2 is up and running (althought still catching up replication), enough for the backups to run in codfw unaltered. Grants have been altered to allow... [14:16:56] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Reedy) >>! In T174802#4593102, @jcrespo wrote: > @Reedy legalteamwiki that you mention on your original task will not be on this list, as these a... [14:18:55] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10jcrespo) > The tables cannot be dropped from other wikis just yet, as the extension is still enabled (but mostly disabled) on other wikis. Thank... [14:19:57] banyek: can you maybe run some of the bad queries on db1114 to see if the plan is at least the same now than in db1119? [14:22:05] I could, but you have to show me where can I find those bad queries? [14:22:26] I'm in a meeting, will look for the ticket later [14:24:39] banyek: I am the person to talk to [14:24:52] jynos: ohhai [14:25:00] I need to find those on the logs or maybe you can help me do that [14:25:19] we need to go to a period pre-dc switch and it is the top failing query on db1114 [14:25:36] while it worked nice on all of codfw and other eqiad hosts [14:25:46] even after doing analyze [14:25:53] hm [14:25:59] I am checking tendril [14:26:09] no, use kibana [14:26:20] tendril is slow for older results [14:27:19] ApiQueryRevisions::run ? [14:27:59] https://logstash.wikimedia.org/goto/fa921005674305a764a9fa3ddeb104ab [14:28:04] should be around here^ [14:28:31] there is a few occurences [14:28:52] they worked on db1119 but not on db1114 despite maintenance, so we just decided to clone it [14:29:19] hopefuly it still works on 9 but also now on 4 [14:29:34] works == takes <0.05 seconds [14:29:56] with that you may work alone? [14:32:28] I think so, yes [14:33:00] ask anything you need, ok? [14:33:08] that is the only thing missing to close that ticket [14:36:18] banyek: did you test starting replication on db1118? Can I or are you working on it? [14:36:51] the mysql8 host? [14:36:54] I did [14:36:55] yes [14:36:57] it crashes [14:37:02] o, really [14:37:06] I didn't know that [14:37:13] crash as in restarts? [14:37:16] in srv I created a directory named strace [14:37:28] wow, so that is serious [14:37:41] we could send a bug report [14:37:43] no, it starts, runs for a while, complaining about the format of binlog, and then 3 minutes after crashes =:-O [14:37:49] :-( [14:37:56] either to mariadb or mysql or both [14:38:12] so there's a strace dir, I ran strace -p -ff -o there [14:38:29] I can accept it stopping replication, but not crashing the server [14:38:36] :-( [14:38:37] checked the straces but I didn't found anything [14:38:46] maybe gdb would be better debugging [14:39:06] I didn't checked graphs, etc. regarding this [14:39:10] O I missread T204594#4592730 [14:39:10] T204594: db1118 mysql process crashed (mysql 8.0 test host) - https://phabricator.wikimedia.org/T204594 [14:39:38] i though you were saying it had crashed 3 minutes after the previous comment [14:39:44] not that it crashed consistently [14:39:57] that is bad news :-( [14:40:03] I restarted it and happened again [14:40:20] if you check the logs with journalctl you'll see [14:40:28] I belive you [14:40:40] I tested it 2 or 3 times, and the 3 minutes + few seconds were consistent [14:40:48] we will have to talk about this on our strategy meeting to see what we do [14:40:51] for me that is the weird part [14:41:02] is it always the same transaction that crashes it? have you checked binlogs? [14:41:27] marostegui: sadly no, I did't get there so far [14:41:41] could be a good thing to check, to see if it is always the same write [14:42:00] marostegui: it shouldn't be [14:42:16] otherwise it would crash immediately [14:42:25] but does replication advance? [14:42:54] based on what banyek says I woudl assume so, but he has to tell [14:43:04] unless it keeps tring for 3 minutes [14:43:13] or [14:43:22] it could be an alter that gets reverted [14:43:28] the seconds_behind_master is constantly counting up, but I didn't looked at the log positions [14:43:40] and that woudl explain why it takes so long to activate [14:43:50] needs more study [14:43:58] jynus: remember I ran a big alter on the eqiad master in the morning and that host crashed in the evening, could be a coincidence or not, but worth checking [14:43:59] then we can send bug reports [14:44:06] marostegui: +1 [14:44:11] I can spend time on that tomorrow [14:44:14] also to check if there is a pattern on the last transactions before the server crashes [14:44:37] I am acking the replication alerts for that host [14:44:37] banyek: yeah, I would suggest we get the db1114 plan checked before investigating db1118 [14:44:54] db1118 is lower importance [14:44:56] indeed [14:44:59] I checked the plan it is the same now [14:45:07] \o\ |o| /o/ [14:45:10] ban looks good [14:45:17] I started to run those queries to see what happens [14:45:25] they're running for minutes now [14:45:34] on both? [14:45:37] then that is bad news [14:45:39] that buffer pool is freezing [14:45:43] buffer pool is probaby cold [14:45:45] (I mean cold) [14:45:59] no, they should take miliseconds even if cold [14:46:11] specially on a second run [14:46:23] banyek: kill them if they last for more than 60 seconds [14:46:28] The first run didn't finished yet [14:46:32] we may have made the problem worse [14:46:32] :O [14:46:54] Maybe run analyze table? [14:46:59] no, that didin't work [14:47:03] I did that the first time [14:47:07] plust reboot and upgrade [14:47:22] I would compare them to other non touched hosts [14:47:26] 654 seconds and in Sending data state [14:47:34] banyek: kill the query, please [14:47:35] :O [14:47:46] we don't need it to finish to see it is badf [14:48:02] queries are killed after 60 second if run by the application [14:48:10] more than that is useless [14:49:05] I killed it [14:49:11] Yeah, it takes 0.0 on db1089 (a non touched host) [14:49:26] I wanted to see how long it ran [14:49:36] don't [14:49:41] not on production [14:50:00] I have enough with develpers doing that [14:50:14] for us, >60 seconds is infinite [14:50:44] it will bring down production equally [14:50:52] "There's no endless loop, you will kill it eventually" as one of my old teacher said [14:51:34] db1105 that was upgraded today to 10.1.36 and also restarted…4 seconds to run the query [14:53:26] that would be more reasonable [14:53:42] nothing is working today :-( [14:54:09] this may need an index hint fix [14:55:14] I am afraid so too, specially if by some reason now db1119 is also misbehaving without any really reason to do so as its data wasn't touched (it got mysql and kernel upgraded though) [14:55:18] mmm, it takes 0.04 sec to me on 1119 [14:55:42] not for me [14:55:46] and 0.06 on 1114 [14:55:55] schroedingbug [14:55:58] which queries are you trying? [14:56:06] jynus: https://logstash.wikimedia.org/app/kibana#/dashboard/DBQuery?_g=h@114715d&_a=h@bd32fd0 first one [14:56:20] heisenbug, sorry [14:56:28] https://phabricator.wikimedia.org/P7562 [14:56:33] me ^ [14:56:51] which used to be slow [14:56:54] I checked this: https://www.dropbox.com/s/oaheiqjg4ndubxv/Screenshot%202018-09-18%2016.56.43.png?dl=0 [14:57:01] so we may habe actually fixed something [14:57:02] jynus: just posted mine [14:57:22] but not all of them, there maybe some that were slow and will continue to be slow [14:57:47] your query runs fast, mine doesn't on db1114 or db1119 but it does on db1089 [14:57:50] I just expect some to be fixed, other may be broken everywhere [14:57:57] mmmh :-( [14:57:58] I leave you guys soon here, I'll have an onboarding chat [14:58:02] at 5pm [14:58:19] we will need more testing [14:58:25] with a larger query sample [14:58:27] yep, this is important [14:58:42] as this can potentially hit all the servers when we upgrade them :| [14:58:44] maybe some change depending on the data status [14:58:45] like db1119 [14:58:50] too [14:59:14] plus queries changing, which is already a thing [14:59:34] yeah, but db1105 was also upgraded today and there my query runs fast [15:00:55] maybe we should investigate with pt-playback [15:00:59] now gtg [15:01:04] see you tomorrow [15:04:21] I think I am going to repool db2041 and call it a day [15:04:44] jynus: sounds good! take some rest [15:09:22] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) @Smalyshev have you noticed any improvements since the above comment was done, and the index is gone from ever... [15:32:30] addshore: ping [15:35:06] pong [15:35:42] addshore: Can you check https://phabricator.wikimedia.org/T204729 to see if you have any idea about what could this be? [15:35:55] I have been following along and nothing has sprung to mind [15:36:01] :_( [15:36:03] but I can have a deeper look this evening :) [15:36:03] thanks [15:36:15] If you could, that'd be great! [15:36:17] thanks [15:36:20] oh wait [15:36:21] not [15:36:27] that a totally different ticket... [15:36:36] yes, I will have a look at that right now ! :) [15:36:44] <3 [15:39:33] it looks like that method is only called by 1 place in core.. ApiQueryInfo [15:39:41] I wonder if it is someone hitting the api hard? *looks* [15:55:32] marostegui: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=now-24h&to=now [15:55:48] banyek: that is the query killer [15:55:55] :( [15:56:09] hmm, there is also UnwatchedPageThreshold that can be altered on that query [15:56:25] looks like it is currently 30 but not sure if that would speed the query up [16:00:14] marostegui: i might have a UA and IP if we want to block the requests [16:00:22] * addshore stares at the logs a bit more [16:00:56] addshore: maybe worth posting the ip(s) on the task [16:00:58] addshore: add them to the ticket [16:01:01] doing [16:01:20] the people that can do that and can aprvoe that will not be us :-) [16:12:41] backporting a fix now instead :) [16:12:50] <3 [16:12:55] thanks [17:35:04] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) See also {T204250}. [18:13:44] 10DBA, 10Data-Services, 10Quarry: Cannot reliably get the EXPLAIN for a query on analytics wiki replica cluster - https://phabricator.wikimedia.org/T195836 (10MusikAnimal) Also note https://tools.wmflabs.org/sql-optimizer is a thing, which uses some hacky workarounds. Some queries still can't be explained, t... [18:45:04] 10DBA, 10Growth-Team, 10MediaWiki-Watchlist, 10Wikimedia-production-error: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10kostajh) This appears to happen pretty frequently (see for example [here](https://logstash.wikimedia.... [20:25:53] 10DBA, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) Looking at logstash: https://logstash.wikimedia.org/goto/39a6fe9edd787798129b66ae9d61ed90 there's definitely a...