[00:46:05] we noticed this for other tools too like wsexport. It used to be very unstable, turns out it was just web crawlers [00:46:23] what happens I think is that the community will link to tools on-wiki (where web crawlers are rampant), and then they try to scrape the results of some tool. The problem of course is that tool fires off long-running queries [04:24:46] musikanimal: that's an interesting conversation to have with cloud team indeed, but how can you detect whether it is a legit use of the tool or a crawler using the tool? [04:28:09] User agents. Most of them are kind of enough to have a UA you can block. I find that not all of them respect robots.txt [04:28:55] Ah, I thought you meant to block those users on a mysql level [04:29:08] Oh no [04:29:34] But indirectly they are affecting the performance and stability of the DB servers, is my belief [04:30:24] A somewhat out-of-date list of bad bots blocked from XTools, see step #12 https://wikitech.wikimedia.org/w/index.php?title=Tool:XTools#Building_a_new_instance [04:32:58] In a normal situation labs do get some load, but they normally can cope decently well with it, but when we have to depool one of them they do suffer indeed [04:33:19] I think this is a conversation to have with bd808 as I am not really sure if blocking cralwers is something we have ever considered [04:33:29] Maybe we can block then in certain situations (like this) [04:35:35] 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) [04:36:04] the truly nasty crawlers use legit UAs from common web browsers [04:36:19] Yeah, i was just about to create a phab task [04:36:32] Yes, I see a lot of those [04:37:17] the robots.txt for Toolforge is not very complete for sure -- https://tools.wmflabs.org/robots.txt [04:37:23] They're a pain for sure, but the legit crawlers usually have something I can block, and it's just a big game of whackamole [04:38:27] Yeah, well the bots I'm blocking do not respect https://xtools.wmflabs.org/robots.txt [04:40:21] I just wonder what the db usage would look like if we blocked all the crawlers and/or made robots.txt more restrictive [04:40:30] musikanimal: make a task and we can talk/think about fancier blocking. I worry that like you said its a game of whack-a-mole and also that we will end up not having any good way to tell what it is achieving [04:41:13] Will do! [04:42:49] I guess I could imagine something really "fun" like having a big list of UAs and IPs that we make solve some kind of captcha (or make a constructive edit on a wiki!) before we let them past the nginx reverse proxy [04:44:10] marostegui: is there an easy way for me (no root on the replica boxes) to see which tools have active db connections at any given time? [04:44:53] bd808: which user do you use to connect to the DB? [04:46:17] you'd need access to performance_schema and information_schema to check out the user stats [04:48:16] marostegui: good question. I guess I generally use a random tool account from the ones I have the creds for when poking manually. [04:49:30] maybe I should get some help setting up an account on the replica boxes that has some rights to poke around but not enough to break things horribly :) [05:00:02] * bd808 fades into the mist until ~14:00Z [05:00:36] bd808: have a good night [05:17:56] I created https://phabricator.wikimedia.org/T226688 [05:18:21] thank you [05:27:00] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:06:14] what is the status of labsdb compression? [07:07:23] I have stopped it on 1011 [07:07:27] And started replication again [07:07:34] Will finish it next week [07:16:25] the amoung of writes is saturating labsdb1011 transaction log [07:16:30] that is a first for us [07:17:10] Maybe I can start replication in batches [07:17:16] Not all at once [07:17:17] no it is ok [07:17:27] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&from=1561609042545&to=1561619842545 [07:17:35] just it is an unusual state [07:18:02] yeah, but expected [07:18:03] having ~4GB of queued writes [07:18:52] I think io is being saturated [07:19:43] see also the amount of writes on s5 is low, we should think of moving other wikist to itat some point [07:19:51] *wikis to it [07:20:01] I was thinking about doing it on the next DC swap [07:20:14] Maybe s5 and s6 even [07:22:46] Shall I create the task to start discussing it? [07:22:49] Or possible candidates [07:23:40] ok to me [07:23:50] I was checking the backup stats already [07:29:56] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:31:29] None reaches the 5GB compressed: https://phabricator.wikimedia.org/P8660 [07:32:42] nice [07:32:51] so good candidates to be moved to either s5 or s6 [07:32:57] I will paste that on the task [07:36:07] https://phabricator.wikimedia.org/P8661 [07:37:45] what is backup_id = 1901 and 1897? [07:39:53] last s3 snapshots on eqiad and codfw [07:40:03] it is the autoinc on the backups table [07:40:04] ah good [07:40:20] I have one more stat coming [07:45:11] these are unreliable https://phabricator.wikimedia.org/P8662 [07:45:35] because there is a lot of ignored events due to the amount of different objects [07:47:10] mediawikiwiki is clearly one of the candidates along with loginwiki [07:47:47] yeah, but I am not sure abouth both, given they are group 0 [07:48:06] but at least we have the stats [07:48:42] yeah, we can discuss, we have plenty of time [07:49:49] it is funny because you can see s5 almost recovered while the others still have 4-5 days of delay [07:50:00] haha yeah [07:50:00] look [07:50:03] Seconds_Behind_Master: 469802 [07:50:03] Seconds_Behind_Master: 425358 [07:50:04] Seconds_Behind_Master: 351056 [07:50:04] Seconds_Behind_Master: 431469 [07:50:04] Seconds_Behind_Master: 123231 [07:50:04] Seconds_Behind_Master: 404877 [07:50:04] Seconds_Behind_Master: 394730 [07:50:04] Seconds_Behind_Master: 432923 [07:51:16] parsercache hit rate is still lowish [07:52:31] yeah, it takes normally like 5 days or so, as we saw in previous changes [07:59:07] es2 and es3 are at 70% disk utilization [08:01:21] but growth seems slowed down since 5/28 [08:02:19] reaching 80% on december 2019 [08:32:47] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [08:56:47] did you have time to give a quick look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/519203 ? [09:07:41] I will try to do so today [09:58:59] oh, today is thursday? [09:59:07] it is \o/ [09:59:14] I was convinced it was wednesday [09:59:20] surprise! [09:59:25] I may arrive late to the meeting [09:59:25] off by one :D [10:08:21] oh crap [11:15:25] 10DBA, 10Epic: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10jcrespo) [11:15:44] 10DBA, 10Epic: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10jcrespo) p:05Triage→03Normal [11:17:52] 10DBA, 10Epic: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10jcrespo) @Anomie You mentioned some potential ES maintenance in the past. Server change would be a great opportunity to transition to a different mw configuration once we are o... [12:40:23] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) @Marostegui, which wikis are affected? Only English Wikipedia? Do you need to display a banner too? [12:41:27] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) >>! In T226358#5288891, @Trizek-WMF wrote: > @Marostegui, which wikis are affected? Only English Wikipedia? > Do you ne... [12:42:14] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Thank you! :) [13:30:47] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Banner set. It will be displayed starting at 05:00 UTC July 3 on all wikis. End at 06:20 UTC. [13:31:30] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thank you! [13:59:30] everything caught up on labsdb1011 but the big ones: s8 is 35k behind, commons 100k and enwiki 260k [13:59:40] they have caught up half way in around 6 hours [13:59:49] so that is looking good [14:05:15] db2102 is running stretch with a bpo kernel? is that intentional or was that pulled in indirectly/unintentional? [14:39:06] I think db2102 was recently installed, unless manuel says something specia labout it, I suppose it was installed the typical method [14:42:46] sooo... I've a question for you for dbctl. When committing/restoring the configuration that mediawiki will see. What kind of SAL message do you expect/have in mind? [14:42:56] jynus: actually it was me who installed the 4.19 kernel... we where debugging some server issues (which a stalled reboot or similar) in early April [15:58:56] regarding SAL, I don't care, whatever happens now for the other mew instances [16:01:44] it's quite different given the amount of data that changes in the db configuration compared to a host action. [16:02:16] honestly, I don't have enough information to say [16:02:21] anyway we have a proposal ( cdanis merit if you'll like it ;) ) and we can improve it later [16:02:34] if it is too verbose, we can send some stuff to #dba only [16:03:01] there is one thing I just remember we may need [16:03:13] but doesn't have to be on a first version [16:05:29] actually, now that I see it, it can stay on codeconfig, so ignore me [16:24:26] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10Cmjohnson) 05Open→03Resolved @Marostegui disk swapped but this server is out of warranty. I would suggest moving masters to new servers. [16:36:56] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10jcrespo) That's the plan. See: ` root@db1072:~$ megacli -PDList -aALL | grep rro Media Error Count: 0 Other Error Count: 0 Media Error Count: 0 Other Error Count: 3 Media Error Count: 0 Other Error Coun... [20:11:14] marostegui: jynus around? [21:24:32] 10DBA, 10Core Platform Team, 10MediaWiki-Database, 10MediaWiki-General-or-Unknown: Investigate query planning in MariaDB 10 - https://phabricator.wikimedia.org/T85000 (10Krinkle) [21:25:36] 10DBA, 10Core Platform Team, 10MediaWiki-General-or-Unknown: Investigate query planning in MariaDB 10 - https://phabricator.wikimedia.org/T85000 (10Krinkle) //(not currently an issue with the RDBMS library or a schema schema. Mass-triaging, feel free to revert if i seems I got it wrong.)//