[00:46:05] <musikanimal>	 we noticed this for other tools too like wsexport. It used to be very unstable, turns out it was just web crawlers
[00:46:23] <musikanimal>	 what happens I think is that the community will link to tools on-wiki (where web crawlers are rampant), and then they try to scrape the results of some tool. The problem of course is that tool fires off long-running queries
[04:24:46] <marostegui>	 musikanimal: that's an interesting conversation to have with cloud team indeed, but how can you detect whether it is a legit use of the tool or a crawler using the tool? 
[04:28:09] <musikanimal>	 User agents. Most of them are kind of enough to have a UA you can block. I find that not all of them respect robots.txt
[04:28:55] <marostegui>	 Ah, I thought you meant to block those users on a mysql level
[04:29:08] <musikanimal>	 Oh no
[04:29:34] <musikanimal>	 But indirectly they are affecting the performance and stability of the DB servers, is my belief
[04:30:24] <musikanimal>	 A somewhat out-of-date list of bad bots blocked from XTools, see step #12 https://wikitech.wikimedia.org/w/index.php?title=Tool:XTools#Building_a_new_instance
[04:32:58] <marostegui>	 In a normal situation labs do get some load, but they normally can cope decently well with it, but when we have to depool one of them they do suffer indeed
[04:33:19] <marostegui>	 I think this is a conversation to have with bd808 as I am not really sure if blocking cralwers is something we have ever considered
[04:33:29] <marostegui>	 Maybe we can block then in certain situations (like this)
[04:35:35] <wikibugs>	 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui)
[04:36:04] <bd808>	 the truly nasty crawlers use legit UAs from common web browsers
[04:36:19] <musikanimal>	 Yeah, i was just about to create a phab task
[04:36:32] <musikanimal>	 Yes, I see a lot of those
[04:37:17] <bd808>	 the robots.txt for Toolforge is not very complete for sure -- https://tools.wmflabs.org/robots.txt
[04:37:23] <musikanimal>	 They're a pain for sure, but the legit crawlers usually have something I can block, and it's just a big game of whackamole 
[04:38:27] <musikanimal>	 Yeah, well the bots I'm blocking do not respect https://xtools.wmflabs.org/robots.txt
[04:40:21] <musikanimal>	 I just wonder what the db usage would look like if we blocked all the crawlers and/or made robots.txt more restrictive
[04:40:30] <bd808>	 musikanimal: make a task and we can talk/think about fancier blocking. I worry that like you said its a game of whack-a-mole and also that we will end up not having any good way to tell what it is achieving
[04:41:13] <musikanimal>	 Will do!
[04:42:49] <bd808>	 I guess I could imagine something really "fun" like having a big list of UAs and IPs that we make solve some kind of captcha (or make a constructive edit on a wiki!) before we let them past the nginx reverse proxy
[04:44:10] <bd808>	 marostegui: is there an easy way for me (no root on the replica boxes) to see which tools have active db connections at any given time?
[04:44:53] <marostegui>	 bd808: which user do you use to connect to the DB?
[04:46:17] <marostegui>	 you'd need access to performance_schema and information_schema to check out the user stats
[04:48:16] <bd808>	 marostegui: good question. I guess I generally use a random tool account from the ones I have the creds for when poking manually.
[04:49:30] <bd808>	 maybe I should get some help setting up an account on the replica boxes that has some rights to poke around but not enough to break things horribly :)
[05:00:02] * bd808 fades into the mist until ~14:00Z
[05:00:36] <marostegui>	 bd808: have a good night
[05:17:56] <musikanimal>	 I created https://phabricator.wikimedia.org/T226688
[05:18:21] <marostegui>	 thank you
[05:27:00] <wikibugs>	 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[07:06:14] <jynus>	 what is the status of labsdb compression?
[07:07:23] <marostegui>	 I have stopped it on 1011
[07:07:27] <marostegui>	 And started replication again
[07:07:34] <marostegui>	 Will finish it next week
[07:16:25] <jynus>	 the amoung of writes is saturating labsdb1011 transaction log
[07:16:30] <jynus>	 that is a first for us
[07:17:10] <marostegui>	 Maybe I can start replication in batches
[07:17:16] <marostegui>	 Not all at once
[07:17:17] <jynus>	 no it is ok
[07:17:27] <jynus>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&from=1561609042545&to=1561619842545
[07:17:35] <jynus>	 just it is an unusual state
[07:18:02] <marostegui>	 yeah, but expected
[07:18:03] <jynus>	 having ~4GB of queued writes
[07:18:52] <jynus>	 I think io is being saturated
[07:19:43] <jynus>	 see also the amount of writes on s5 is low, we should think of moving other wikist to itat some point
[07:19:51] <jynus>	 *wikis to it
[07:20:01] <marostegui>	 I was thinking about doing it on the next DC swap
[07:20:14] <marostegui>	 Maybe s5 and s6 even
[07:22:46] <marostegui>	 Shall I create the task to start discussing it?
[07:22:49] <marostegui>	 Or possible candidates
[07:23:40] <jynus>	 ok to me
[07:23:50] <jynus>	 I was checking the backup stats already
[07:29:56] <wikibugs>	 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[07:31:29] <jynus>	 None reaches the 5GB compressed: https://phabricator.wikimedia.org/P8660
[07:32:42] <marostegui>	 nice
[07:32:51] <marostegui>	 so good candidates to be moved to either s5 or s6
[07:32:57] <marostegui>	 I will paste that on the task
[07:36:07] <jynus>	 https://phabricator.wikimedia.org/P8661
[07:37:45] <marostegui>	 what is backup_id = 1901 and 1897?
[07:39:53] <jynus>	 last s3 snapshots on eqiad and codfw
[07:40:03] <jynus>	 it is the autoinc on the backups table
[07:40:04] <marostegui>	 ah good
[07:40:20] <jynus>	 I have one more stat coming
[07:45:11] <jynus>	 these are unreliable https://phabricator.wikimedia.org/P8662
[07:45:35] <jynus>	 because there is a lot of ignored events due to the amount of different objects
[07:47:10] <marostegui>	 mediawikiwiki is clearly one of the candidates along with loginwiki
[07:47:47] <jynus>	 yeah, but I am not sure abouth both, given they are group 0
[07:48:06] <jynus>	 but at least we have the stats
[07:48:42] <marostegui>	 yeah, we can discuss, we have plenty of time
[07:49:49] <jynus>	 it is funny because you can see s5 almost recovered while the others still have 4-5 days of delay
[07:50:00] <marostegui>	 haha yeah
[07:50:00] <marostegui>	 look
[07:50:03] <marostegui>	         Seconds_Behind_Master: 469802
[07:50:03] <marostegui>	         Seconds_Behind_Master: 425358
[07:50:04] <marostegui>	         Seconds_Behind_Master: 351056
[07:50:04] <marostegui>	         Seconds_Behind_Master: 431469
[07:50:04] <marostegui>	         Seconds_Behind_Master: 123231
[07:50:04] <marostegui>	         Seconds_Behind_Master: 404877
[07:50:04] <marostegui>	         Seconds_Behind_Master: 394730
[07:50:04] <marostegui>	         Seconds_Behind_Master: 432923
[07:51:16] <jynus>	 parsercache hit rate is still lowish
[07:52:31] <marostegui>	 yeah, it takes normally like 5 days or so, as we saw in previous changes
[07:59:07] <jynus>	 es2 and es3 are at 70% disk utilization
[08:01:21] <jynus>	 but growth seems slowed down since 5/28
[08:02:19] <jynus>	 reaching 80% on december 2019
[08:32:47] <wikibugs>	 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui)
[08:56:47] <jynus>	 did you have time to give a quick look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/519203 ?
[09:07:41] <marostegui>	 I will try to do so today
[09:58:59] <jynus>	 oh, today is thursday?
[09:59:07] <marostegui>	 it is \o/
[09:59:14] <jynus>	 I was convinced it was wednesday
[09:59:20] <marostegui>	 surprise!
[09:59:25] <jynus>	 I may arrive late to the meeting
[09:59:25] <volans>	 off by one :D
[10:08:21] <mark>	 oh crap
[11:15:25] <wikibugs>	 10DBA, 10Epic: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10jcrespo)
[11:15:44] <wikibugs>	 10DBA, 10Epic: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10jcrespo) p:05Triage→03Normal
[11:17:52] <wikibugs>	 10DBA, 10Epic: Setup es4 and es5 replica sets for new read-write external store service - https://phabricator.wikimedia.org/T226704 (10jcrespo) @Anomie You mentioned some potential ES maintenance in the past. Server change would be a great opportunity to transition to a different mw configuration once we are o...
[12:40:23] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) @Marostegui, which wikis are affected? Only English Wikipedia? Do you need to display a banner too?
[12:41:27] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) >>! In T226358#5288891, @Trizek-WMF wrote: > @Marostegui, which wikis are affected? Only English Wikipedia? > Do you ne...
[12:42:14] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Thank you! :)
[13:30:47] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Banner set. It will be displayed starting at 05:00 UTC July 3 on all wikis. End at 06:20 UTC.
[13:31:30] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thank you!
[13:59:30] <marostegui>	 everything caught up on labsdb1011 but the big ones: s8 is 35k behind, commons 100k and enwiki 260k
[13:59:40] <marostegui>	 they have caught up half way in around 6 hours
[13:59:49] <marostegui>	 so that is looking good
[14:05:15] <moritzm>	 db2102 is running stretch with a bpo kernel? is that intentional or was that pulled in indirectly/unintentional?
[14:39:06] <jynus>	 I think db2102 was recently installed, unless manuel says something specia labout it, I suppose it was installed the typical method
[14:42:46] <volans>	 sooo... I've a question for you for dbctl. When committing/restoring the configuration that mediawiki will see. What kind of SAL message do you expect/have in mind?
[14:42:56] <moritzm>	 jynus: actually it was me who installed the 4.19 kernel... we where debugging some server issues (which a stalled reboot or similar) in early April
[15:58:56] <jynus>	 regarding SAL, I don't care, whatever happens now for the other mew instances
[16:01:44] <volans>	 it's quite different given the amount of data that changes in the db configuration compared to a host action.
[16:02:16] <jynus>	 honestly, I don't have enough information to say
[16:02:21] <volans>	 anyway we have a proposal ( cdanis merit if you'll like it ;) ) and we can improve it later
[16:02:34] <jynus>	 if it is too verbose, we can send some stuff to #dba only
[16:03:01] <jynus>	 there is one thing I just remember we may need
[16:03:13] <jynus>	 but doesn't have to be on a first version
[16:05:29] <jynus>	 actually, now that I see it, it can stay on codeconfig, so ignore me
[16:24:26] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10Cmjohnson) 05Open→03Resolved @Marostegui disk swapped but this server is out of warranty. I would suggest moving masters to new servers.
[16:36:56] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10jcrespo) That's the plan. See:   ` root@db1072:~$ megacli -PDList -aALL | grep rro Media Error Count: 0 Other Error Count: 0 Media Error Count: 0 Other Error Count: 3 Media Error Count: 0 Other Error Coun...
[20:11:14] <Amir1>	 marostegui: jynus around?
[21:24:32] <wikibugs>	 10DBA, 10Core Platform Team, 10MediaWiki-Database, 10MediaWiki-General-or-Unknown: Investigate query planning in MariaDB 10 - https://phabricator.wikimedia.org/T85000 (10Krinkle)
[21:25:36] <wikibugs>	 10DBA, 10Core Platform Team, 10MediaWiki-General-or-Unknown: Investigate query planning in MariaDB 10 - https://phabricator.wikimedia.org/T85000 (10Krinkle) //(not currently an issue with the RDBMS library or a schema schema. Mass-triaging, feel free to revert if i seems I got it wrong.)//