[06:09:42] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4074913 (10Marostegui) [06:09:45] 10DBA, 10Operations, 10ops-eqiad: db1052 (s1 master) disks with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4074910 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now! ``` root@db1052:~# megacli -LDPDInfo -aAll | egrep -i "slot|error|failure count|s.m.a.r.t"... [06:10:21] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T190446#4074914 (10Marostegui) 05Open>03Resolved a:03Marostegui This was part of a controlled replacement of disks with predictive failure. It is all good now ``` root@db1052:~# megacli -LDPDInfo -aAll Adap... [06:12:07] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#4074917 (10Marostegui) [06:12:27] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4074920 (10Marostegui) [06:12:33] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#3573512 (10Marostegui) 05Open>03Resolved All finished [06:38:30] there is a long transaction happening on labsdb1004 (slave) that's been running for 7h... [06:39:02] (a write coming from the master) [06:49:59] looks like it is related to: u2815__p`.`all_articles [07:12:55] I have created this: https://phabricator.wikimedia.org/T190488 [07:14:06] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4074957 (10Marostegui) [07:15:34] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10MW-1.31-release-notes (WMF-deploy-2018-03-06 (1.31.0-wmf.24)), and 2 others: Consider dropping the "wb_items_per_site.wb_ips_site_page" index - https://phabricator.wikimedia.org/T179793#4074970 (10Marostegui) Was this deployed? [07:50:23] can I kill the screens m_check and compress_db1113 ? [07:50:58] yep [07:51:22] just did it [07:51:31] cool [07:58:56] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 7 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4075020 (10jcrespo) From my point of view, the problem has gone: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?panelI... [08:09:51] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 7 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4075026 (10jcrespo) I would even dare to say the baseline is lower: https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?orgId=1&... [08:12:28] * AaronSchulz oggles at https://mariadb.com/kb/en/library/aborting-statements/ [08:14:23] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Query_Limits [08:14:34] we don't have yet 10.1 everywhere on production [08:15:14] mysql implementations is quite different [08:16:44] https://gerrit.wikimedia.org/r/#/c/418593/2/wmfmariadbpy/WMFMariaDB.py [08:19:53] relevant tickets: https://phabricator.wikimedia.org/T160984 [08:20:06] https://phabricator.wikimedia.org/T149421 [08:35:47] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10MW-1.31-release-notes (WMF-deploy-2018-03-06 (1.31.0-wmf.24)), and 2 others: Consider dropping the "wb_items_per_site.wb_ips_site_page" index - https://phabricator.wikimedia.org/T179793#4075102 (10Marostegui) 05Open>03Resolved >>! In T1797... [10:44:58] Amir1: I have updated tin to HEAD, but I have not deployed anything [10:45:15] what's up [10:45:19] there is an undeployed patch Lucas [10:45:21] *from [10:45:28] I didn't see him online [10:45:40] I did SWAT his patch last night [10:45:41] https://gerrit.wikimedia.org/r/#/c/421333/ [10:45:47] and I'm pretty sure I deployed it [10:45:58] well, it was behind on tin [10:46:11] I can check if someone rollback tin only [10:46:16] "23:39 ladsgroup@tin: Synchronized wmf-config/Wikibase-production.php: Disable reading wb_terms search fields on wikidata (T189777) (duration: 00m 58s)" [10:46:16] T189777: Disable reading from term_search_key from wb_terms table in wikidata - https://phabricator.wikimedia.org/T189777 [10:46:19] *rolled back [10:46:50] hmm, It can be that I forgot to do rebase which happen too often [10:46:58] definitely not deployed [10:47:00] can confirm [10:47:10] let me deploy mine [10:47:27] and I will let you handle that as you wish (revert, deploy, etc) [10:47:29] SAL says I sync'ed the file, which means I forgot to rebase, sorry [10:47:38] oh, no problem to me [10:47:42] I just deploy it [10:47:42] this was a heads up [10:47:48] Thank you [10:48:04] sometimes they are harmless, as I guess this one [10:48:11] but sometimes it can lead to problems [11:06:01] 10DBA, 10Data-Services, 10Toolforge: Possibly a big update going to: u2815__p`.`all_articles - https://phabricator.wikimedia.org/T190488#4075374 (10jcrespo) a:03jcrespo [11:14:46] 10DBA, 10Data-Services, 10Toolforge, 10Patch-For-Review: Possibly a big update going to: u2815__p`.`all_articles - https://phabricator.wikimedia.org/T190488#4075392 (10jcrespo) Things are getting better now: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prome... [11:22:52] 10DBA, 10Data-Services, 10Toolforge, 10Patch-For-Review: Possibly a big update going to: u2815__p`.`all_articles - https://phabricator.wikimedia.org/T190488#4075401 (10Dispenser) >>! In T190488#4075372, @jcrespo wrote: > That would be cool, thank you! I can remove the filter and reimport the table to the (... [14:15:42] 10DBA, 10Data-Services, 10Toolforge, 10Patch-For-Review: Possibly a big update going to: u2815__p`.`all_articles - https://phabricator.wikimedia.org/T190488#4075877 (10jcrespo) 05Open>03Resolved [14:16:11] 10DBA, 10Data-Services, 10Toolforge, 10Patch-For-Review: Possibly a big update going to: u2815__p`.`all_articles - https://phabricator.wikimedia.org/T190488#4075878 (10Marostegui) Thanks for handling this [14:40:10] 10DBA, 10Collaboration-Team-Triage, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762#3015629 (10Marostegui) This table only exists on s1 (enwiki) 352711 rows s3 (testwiki) 162 rows s3 (test2wiki) 77 rows [14:41:08] 10DBA, 10Collaboration-Team-Triage, 10MediaWiki-extensions-PageCuration, 10Schema-change: Drop ptrl_comment in production - https://phabricator.wikimedia.org/T157762#4075926 (10Marostegui) [14:44:54] 10DBA: Drop flaggedrevs tables from mediawikiwiki - https://phabricator.wikimedia.org/T186865#3957724 (10Marostegui) @demon all these?: ``` root@db1075[mediawikiwiki]> show tables like 'flaggedrevs%'; +----------------------------------------+ | Tables_in_mediawikiwiki (flaggedrevs%) | +-------------------------... [15:11:25] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-03-13 (1.31.0-wmf.25)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4075999 (10EddieGP) 05Open>03Resolved Thanks @Dzahn! With the cron... [15:46:26] 10DBA, 10HHVM: High database error rates on s2 and s3 - https://phabricator.wikimedia.org/T185646#4076112 (10Marostegui) Should we consider this solved? It has not happened for a month: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=10&fullscreen&orgId=1&from=now-30d&to=now [15:53:42] 10DBA, 10HHVM: High database error rates on s2 and s3 - https://phabricator.wikimedia.org/T185646#4076146 (10jcrespo) Actually, this got solved for s2 this morning, but not it is happening on s1: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=10&fullscreen&orgId=1&from=1521744688622&to=152... [15:56:09] 10DBA, 10HHVM: High database error rates on s2 and s3 - https://phabricator.wikimedia.org/T185646#4076162 (10Marostegui) I am not sure I am understanding your comment, on that graph (and the one I posted) it is mostly 0 errors, no? [15:56:34] ^check the unit- the unit is errors/second [15:57:22] 0.350 errors is 10 errors per minute- not alarming, but why do they happen so regularly, and stop on one and start on the next? [15:57:33] Ah! [15:57:36] It is per second! [15:57:37] Right right [15:57:46] more like 21 errors per minute [15:58:01] which is ok as a spike, but happens so regularly [15:58:15] it is probably some misconfiguration or something that can be easily avoided [15:59:02] can you think of something you changed at 8:33. could be unrelated, but may give leads to why it happened [15:59:09] I will also review my logs [15:59:20] today at 8:33? [15:59:44] and it started on s1 at 8:47 [16:00:12] I haven't changed anything today, just cleaning up tickets really [16:00:15] could be something related to monitoring [16:00:17] haven't made any change to core today [16:00:36] yes, I am just trying to figure out a reason [16:00:55] yeah, I am trying to think... [16:01:08] did you change for example one prometheus host that was missconfigured on s1 ? [16:01:10] Regarding monitoring we just moved db1067 from s2 to s1 and that was later than 8:33 [16:01:15] ah [16:01:20] but that is a good lead [16:01:30] https://gerrit.wikimedia.org/r/#/c/421477/ [16:01:31] could be some misconfiguration of prometheus or something [16:01:32] 9:23 [16:01:45] 8:23 UTC [16:01:48] :-) [16:01:55] almost surely related [16:02:04] something like a bad prometheus config or something [16:02:08] maybe we can depool that host [16:02:10] it is an old one [16:02:17] not sure it is mediawiki [16:02:29] I would try prometheus first [16:02:33] stopping the exporter [16:02:44] we can try to stop the exporter of that host [16:02:45] XD [16:03:50] but it looks related to that host, it would be too much of a coincidence [16:03:53] from s2 to s1 [16:07:51] I was following that for months, and you probably solved the mistery [16:09:40] Should we try to stop puppet and stop the exporter for the weekend? [16:10:56] let me try to figure out the source, or at least the kind of errors [16:11:36] I think it is access_denied_errors [16:12:03] on logtash I cannot find errors for that IP [16:12:18] probably not mediawiki [16:12:26] prometheus or icinga or something else [16:12:29] yeah, wanted to double check [16:13:07] SHOW GLOBAL STATUS like 'access_denied_errors'; [16:13:26] is growin at 0.35/s or maybe a bit less [16:14:03] Yeah and for db1089 (s1 as well), it only has 2 errors [16:14:31] so something is getting access denied [16:15:01] did you stop the exporter? [16:15:05] nope [16:15:08] I am comparing grants [16:16:01] there is some weird grants [16:16:07] could it be tendril? [16:16:21] it used to have some slow log functionality [16:16:31] or it could be something else [16:17:30] that host doesn't have grants for 'tendril'@'10.%' [16:17:45] not tendril, watchdog [16:18:24] No, I am talking about tendril@10.% , that user isn't on db1067, but it is on db1089 [16:18:33] not saying it is the cause [16:18:39] just saying that that grant is different [16:19:00] the rest look the same [16:20:19] the root password hash is different [16:20:44] I know what it is [16:20:50] but the password is the same, I guess one is updated to the new password format [16:21:08] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#4076225 (10Cmjohnson) [16:21:12] set log_warnings to 2, tailed the error log [16:21:20] check the error log, it is there [16:21:24] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4076229 (10Cmjohnson) [16:21:26] :-) [16:21:28] ha! [16:21:28] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3978426 (10Cmjohnson) 05Open>03Resolved [16:21:30] I see it now! [16:22:03] it is tendril, but probably that functionality I never new how to use [16:22:20] related to the crons I didn't know what they were for [16:22:48] but that user isn't on db1089 [16:22:50] let me set the warnings back to 0 [16:23:00] this needs deeper investigation [16:23:38] but knowing it is not critical, I am happier [16:23:55] but that user doesn't exist on db1089 either eh [16:24:06] so it must be something not up to date or something [16:24:09] yeah, it is a one-time specific thing [16:24:18] I think for slow log monitoring [16:24:28] we honestly, I never knew how it worked [16:24:36] and probably no longer needed due to performance_schema [16:25:07] will comment on ticket, maybe followup or kill it next week [16:25:20] yeah, I would just kill it anyways [16:25:32] tendril.cnf doesn't have any user = monitor either [16:25:35] so when I ask you "did you do something" it is not to blame you, it is because you may have solved the mistery [16:25:40] :-) [16:25:44] and you certainly did [16:25:48] haha by chance [16:25:54] it is connecting every 10 seconds... [16:27:13] 10DBA, 10HHVM: High database error rates on s2 and s3 - https://phabricator.wikimedia.org/T185646#4076258 (10jcrespo) We followed up on IRC, but we identified this is a terbium process trying to connect to db1067 and not having the right grants. We are not even sure if that should be running, so we may look at... [16:27:23] aka 0.35 times per second [16:27:30] :-) [16:27:47] give or take [16:31:39] I think I found what it is [16:31:43] with a traffic capture [16:32:02] I was looking at puppet [16:32:05] program_name.proxysql_monitor [16:32:06] :) [16:32:14] ah [16:32:18] so not tendril [16:32:20] the proxy [16:32:23] yeeep [16:32:29] looks so, from tcpdump [16:33:41] you can do mysql --socket=/run/proxysql/proxysql_admin.sock on terbium to check the config [16:34:14] 10DBA, 10HHVM: High database error rates on s2 and s3 - https://phabricator.wikimedia.org/T185646#4076273 (10Marostegui) So by doing a traffic capture, I have seen proxysql trying to connect from terbium to db1067 every 10 seconds, which matches what we saw on the logs: ``` libmariadb._pid.1572._client_version... [16:35:16] mysql-monitor_username [16:35:33] | mysql-monitor_username | monitor | [16:35:36] haah [16:35:38] yeah [16:35:49] I guess I can add it? [16:35:58] yeah +1 [16:36:11] then we can confirm it is fixed by enabling log warnings again [16:38:30] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4076284 (10Cmjohnson) [16:38:35] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4076286 (10Cmjohnson) [16:38:39] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4076287 (10Cmjohnson) [16:38:46] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035423 (10Cmjohnson) 05Open>03Resolved [16:41:02] I think it stopped now [16:41:06] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4076293 (10Marostegui) [16:41:09] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#4076292 (10Marostegui) [16:42:16] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3189449 (10Marostegui) [16:42:18] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4076294 (10Marostegui) [16:42:37] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3200902 (10Marostegui) [16:42:39] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#4076296 (10Marostegui) [16:43:19] not seeing the error again on the log? [16:43:23] nope [16:43:28] fixed! [16:43:33] \o\ |o| /o/ [16:43:37] the proxy stuff, as you understand [16:43:43] not in the finest of the states [16:43:53] if all the issues were those...! [16:45:32] 10DBA, 10HHVM: High database error rates on s2 and s3 - https://phabricator.wikimedia.org/T185646#4076304 (10jcrespo) 05Open>03Resolved a:03Marostegui Finally fixed, it was a badly configured proxysql (not on production). [16:46:27] After that, I think it is the perfect time to say Hello to the weekend! [16:54:38] for the holidays? [16:54:52] yeah! [16:54:58] have a nice week! [16:55:09] you too!, I will see you then 2nd :) [16:55:14] *the 2nd [16:55:25] bye [16:55:46] 10DBA, 10Wikimedia-Site-requests: Request to manually add "bot flag" to past edits by bot account - https://phabricator.wikimedia.org/T190538#4076326 (10Od1n) [16:56:18] ^ I think that task has nothing to do with us [17:26:47] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4076488 (10RobH) a:05RobH>03Cmjohnson [17:27:10] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4076492 (10RobH) a:05RobH>03Cmjohnson [17:27:16] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4076495 (10RobH) a:05RobH>03Cmjohnson [20:28:47] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 6 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4077242 (10mobrovac) 05Open>03Resolved a:03Pchelolo I agree that all of the issues have been fixed, but to my understanding the sco... [22:49:50] 10DBA, 10Phabricator, 10Release-Engineering-Team (Next): Switch phabricator production to codfw - https://phabricator.wikimedia.org/T164810#4077667 (10mmodell)