[06:27:41] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687248 (10Marostegui) Makes sense - I will closed it as resolved and will report back if Percona/MariaDB report some findings.
[06:27:53] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687249 (10Marostegui) 05Open>03Resolved
[06:36:02] <wikibugs>	 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2687265 (10Marostegui) a:03Marostegui
[06:44:23] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 2 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687274 (10Legoktm) The queries listed in the bug so far all seem to occur for the MassMessage system user "MediaWiki message delivery" (by lookin...
[06:48:51] <legoktm>	 marostegui: ^
[06:49:16] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 2 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687278 (10Marostegui) Thanks a lot @Legoktm for spending time on this.  Hopefully by changing the code as you just did and converting them to Inn...
[06:49:24] <marostegui>	 legoktm: good timing, just commented :)
[06:49:58] <marostegui>	 legoktm: The thing is…we still don't know how can that cause the server to stall and if there will be more cases :_(
[06:50:05] <marostegui>	 (with other queries)
[06:50:05] <legoktm>	 heh :)
[06:50:20] <legoktm>	 yeah...that's totally out of my knowledge
[06:50:28] <marostegui>	 And ours :-)
[06:50:41] <marostegui>	 And looks like out from MariaDB and Percona too, as they are kinda blaming each other now
[06:51:19] <legoktm>	 hah
[07:21:37] <wikibugs>	 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687311 (10Marostegui) This table has been dropped for good in both DCs.
[07:21:53] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687314 (10Marostegui)
[07:53:38] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687359 (10Marostegui) TO_DROP_hitcounter table has been deleted from both DCs, I believe this ticket can be closed now.
[07:59:31] <marostegui>	 jynus: Were you able to talk to MariaDB/Monty in the end yesterday?
[07:59:35] <marostegui>	 Just curious
[07:59:59] <jynus>	 I briefly spoke at someone at the mariadb foundation
[08:03:17] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687376 (10jcrespo) ``` MariaDB MARIADB db1083 enwiki > SHOW TABLES like '\_counter%';; +-------------------------------+ | Tables_in_enwiki (\_counter%) | +-----------------------------...
[08:05:05] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687379 (10Marostegui) Shall I drop it also from S1 then? The original post said S3 wikis, but I can research if it can be dropped from everywhere.
[08:08:36] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687380 (10jcrespo) I wrote that, and I meant that I had to recreate them on s3 because I restarted one of those, creating issues there. If a table is dropped from production, we should...
[08:11:39] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687393 (10Marostegui) Makes sense - I will review that, rename and if doesn't cause issues drop them.
[08:12:22] <wikibugs>	 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687394 (10Marostegui) This table exists in S1 - check and drop if possible.
[08:12:34] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687399 (10Marostegui)
[08:12:36] <wikibugs>	 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687398 (10Marostegui) 05Resolved>03Open
[08:36:27] <godog>	 hi! not sure if you've seen it but db1081 / db1084 still saturate their ethernet ports, expected?
[08:38:47] <marostegui>	 godog: I can check, although I am probably missing context that jynus might have :-)
[08:40:50] <marostegui>	 godog: so it spikes from time to time reaching 1Gbps as I see in the graphs?
[08:41:23] <godog>	 marostegui: yeah, I noticed it from the emails I'm getting from librenms, we alert on high port utilization
[08:41:32] <godog>	 most times it is nothing and/or expected though
[08:44:39] <marostegui>	 Those times were there are network spikes matches when there are spikes of DELETEs and UPDATES
[08:44:53] <marostegui>	 Maybe jobs running at that time?
[08:46:17] <volans>	 marostegui: I think the cause is another, it's outbound traffic, not inbound
[08:48:18] <volans>	 I can see slow queries too at those times
[08:48:28] <volans>	 (I'm looking at db1081
[08:51:29] <marostegui>	 In db1084 pt-kill killed some queries at around that time
[08:52:18] <volans>	 ok, so from what I'm seeing I see a lot of SELECT  img_metadata FROM `image` 
[08:52:21] <volans>	 from ForeignDBFile::loadExtraFromDB
[08:52:30] <volans>	 the img_metadata field is a blob
[08:52:42] <marostegui>	 :(
[08:52:44] <volans>	 it might justify the high output size
[08:54:37] <volans>	 all in commons ofc
[08:55:26] <godog>	 ah, mhh it might be related to WLM as it just finished I think
[08:55:37] <marostegui>	 WLM?
[08:57:46] <volans>	 the strange thing is that those queries are querying by primary key (img_name)... so shouldn't be slow :(
[08:58:31] <godog>	 marostegui: wiki loves monuments, runs in september
[09:00:44] <marostegui>	 Ah, thanks :)
[09:01:43] <godog>	 np, yearly reoccurence where there's a spike of uploads to swift :))
[09:01:47] <volans>	 so my question is what is supposed to be put in the img_metadata?
[09:02:11] <volans>	 from one random from the slow queries length(img_metadata) is 11MB
[09:04:28] <volans>	 mmmh I might have found it... 
[09:04:32] <volans>	 they are all .pdf
[09:05:05] <volans>	 actually it might be the same all the time
[09:05:50] <volans>	 mostly the same, but there are others too
[09:05:59] <volans>	 so I guess the PDF is stored in the metadata? :(
[09:17:03] <volans>	 so we have the explanation of the traffic peaks
[09:18:33] <volans>	 now we need to know if it's normal that kind of content and why was queried so many times in a short timeframe
[09:21:39] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687625 (10Marostegui) So `_counters`still exist at:  S1 enwiki  S2 bgwiki bgwiktionary cswiki enwikiquote enwiktionary eowiki fiwiki idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki thw...
[09:22:51] <volans>	 I found this https://phabricator.wikimedia.org/T96360
[09:22:51] <wikibugs>	 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687629 (10Marostegui) So `_counters` and `hitcounter` still exist at:  S1 enwiki  S2 bgwiki bgwiktionary cswiki enwikiquote enwiktionary eowiki fiwiki idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki thwiki trwiki zh...
[09:23:01] <volans>	 T96360
[09:23:03] <stashbot>	 T96360: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360
[09:23:07] <volans>	 godog , marostegui ^^^
[09:23:25] <marostegui>	 wow,. that is old
[09:24:17] <volans>	 I can confirm many are from bots
[09:26:46] * volans brb
[09:28:26] <godog>	 interesting, the task above seems to cover djvu but not pdf
[09:36:30] <volans>	 I don't see a PDF handler here https://github.com/wikimedia/mediawiki/tree/master/includes/media
[09:40:48] <jynus>	 volans, is that happening again?
[09:41:55] <volans>	 jynus: it happened twice tonight and another time yesterday as I can tell from ganglia/tendril
[09:42:49] <jynus>	 that was probable me doing a schema change?
[09:43:00] <jynus>	 on depooled slaves
[09:43:30] <volans>	 jynus: not, you lost some history I can see you disconnected and reconnected
[09:43:46] <jynus>	 sorry
[09:43:51] <wikibugs>	 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687676 (10Marostegui) Using a slave with not much traffic of each shard I have renamed both tables to make sure there are no replication issues. If in 48h all looks fine, I will go ahea...
[09:43:57] <volans>	 so basically most of them are querying img_metadata from image on commonswiki on a PDF document that has a ~13MB metadata field
[09:44:19] <volans>	 mostly from bots, there were like 12k select on those, that justify the network saturation in outbound
[09:44:36] <volans>	 I've found that is the same behaviour of T96360
[09:44:37] <stashbot>	 T96360: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360
[09:44:46] <volans>	 just for PDF instead of Djvu
[09:45:18] <jynus>	 is it related to refreshlinks, or regular queries?
[09:45:31] <jynus>	 sorry, I with regular I mean webrequest from users
[09:45:31] <volans>	 ForeignDBFile::loadExtraFromDB
[09:45:38] <volans>	 mostly web bots
[09:45:46] <volans>	 google, msn...
[09:45:48] <jynus>	 ok, thanks
[09:45:52] <volans>	 but not only
[09:46:05] <jynus>	 maybe we should file a new ticket and reference that one
[09:46:09] <volans>	 see tendril slow queries for the last 10h on db1081 for example
[09:46:13] <volans>	 I think so, I can do it
[09:46:42] <jynus>	 note that even if it wasn't the schema change
[09:46:49] <jynus>	 we were on reduced load
[09:46:58] <jynus>	 2 out of 3 servers for most of yesterday
[09:47:06] <jynus>	 that could contribute to it
[09:47:25] <volans>	 sure, although 12k select of 13MB is ~156GB ;)
[09:47:26] <jynus>	 we can also pool more servers as a temporary measure
[09:47:42] <jynus>	 well, I am not sayin that as a fix
[09:47:53] <jynus>	 but as a measure to not saturate the link until a fix is done
[09:47:57] <volans>	 ofc it helps to spread on more 1Gb cables ;)
[09:48:03] <jynus>	 yep
[09:48:04] <volans>	 yeah
[09:48:14] <volans>	 let me open the task
[09:48:15] <jynus>	 how did you got that?
[09:48:26] <jynus>	 were you monitoring ethernet stats?
[09:48:30] <volans>	 godog see the librenms alarm on port utilization
[09:48:33] <volans>	 *saw
[09:48:36] <volans>	 and asked here
[09:48:37] <jynus>	 thanks
[09:49:28] <volans>	 maybe you can add a mail rule to put in your inbox the db ones?
[09:51:54] <jynus>	 I cannot even find that alert
[09:52:12] <volans>	 godog ^^^ :)
[09:54:26] <jynus>	 I can see several alerts, but router ones, not server-specific
[09:54:35] <jynus>	 well, not routers, switches
[09:54:56] <volans>	 I guess is that, port utilization checks are on the switches IIRC
[09:55:37] <jynus>	 I see it now, the email has as subject "Alert for device asw-a-eqiad.mgmt.eqiad.wmnet - Port utilisation over threshold"
[09:55:49] <jynus>	 and the host in the body only
[09:56:27] <godog>	 yup, that
[09:56:40] <godog>	 it doesn't happen often though, mostly for analytics hosts
[10:04:49] <jynus>	 I checked stats and it is indeed an issue, but I think not an UBN one
[10:05:54] <volans>	 I'm putting it high
[10:06:01] <jynus>	 I will continue monitoring: https://grafana-admin.wikimedia.org/dashboard/db/server-board?panelId=10&fullscreen&from=1475143509645&to=1475575509645&var-server=db1081&var-network=eth0
[10:06:06] <volans>	 just double checking the content is a serialized array
[10:12:58] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 06Operations: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2687819 (10Volans)
[10:13:36] <volans>	 jynus, marostegui ^^^
[10:13:42] <volans>	 all yours :)
[10:14:56] <marostegui>	 volans: Thanks, now we can really track it :-)
[10:16:06] <volans>	 jynus: adds additional tags if you think they are needed, the old one was having also availability and multimedia
[10:20:26] <jynus>	 well, while it is causing operations issues, I do not think it should be handled at that layer, aside from mitigation
[10:50:30] <elukey>	 qq - I installed a new memcached on mc2009 in codfw and I played with its new logging capabilities. This is what I see if I type "watch fetchers": 
[10:50:33] <elukey>	 ts=1475577987.200351 gid=580 type=item_get key=global%3AChronologyProtector%3Aaede7f15780989d0f428135c3cf94839 status=not_found
[10:51:11] <elukey>	 the only ref to "chronology protector" that I found was in https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage :D
[10:51:45] <elukey>	 do you guys have more info about it?
[12:15:52] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688126 (10Dzahn)
[12:16:47] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688140 (10Dzahn)
[12:17:22] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688126 (10Dzahn)
[12:41:06] <wikibugs>	 10DBA, 06Operations: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2688222 (10Marostegui) This table has been renamed to `TO_DROP_email_capture` across eqiad in the following wikis.  S1: enwiki S3: testwiki
[13:01:02] <wikibugs>	 10DBA: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2688288 (10Marostegui)
[13:28:50] <wikibugs>	 10DBA, 06Operations, 06Performance-Team, 07Availability, 07Wikimedia-Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2688443 (10BBlack)
[13:37:10] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688485 (10jcrespo) This is not a blocking step at this point- the process can continue but this must be kept open until the production side of filtering is run.
[13:48:42] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2688549 (10Marostegui) I have created the HW request to get it removed: https://phabricator.wikimedia.org/T147309  Also @Volans and myself were unsure about whether it can be safely remove from the array...
[13:52:34] <wikibugs>	 10DBA: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2688571 (10Marostegui)
[14:08:46] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2683544 (10Cmjohnson) Replaced disk 0 on db1055...rebuilding  Adapter #0  Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5000C50...
[14:09:52] <wikibugs>	 10DBA, 06Operations, 10ops-eqiad: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2688636 (10Marostegui) Thanks Chris - will check in a few hours and will close this as resolved once it has finished.  Manuel
[14:48:07] <jynus>	 marostegui, remember to check the physical location if possible when moving shards of a server around
[14:48:15] <marostegui>	 jynus: Some handover I have collected today so you can get a full picture of the day: https://phabricator.wikimedia.org/P4153
[14:48:27] <marostegui>	 oh, you are here
[14:48:50] <marostegui>	 jynus: what do you mean? you are talking about the ticket to replace db1019 with another one for S4 rc service?
[14:49:02] <jynus>	 yes
[14:49:33] <marostegui>	 ok :)
[14:50:53] <jynus>	 good news for dbstores, it seems
[14:51:00] <marostegui>	 yeah :)
[14:51:04] <jynus>	 good
[14:51:19] <volans>	 marostegui: FYI our failure "units" are rack, row, datacenter apart the single server ofc
[14:51:38] <marostegui>	 volans: you lost me
[14:52:10] <volans>	 about the physical location, if you want to spread for redundancy
[14:52:18] <marostegui>	 aaah gotcha
[14:52:33] <marostegui>	 :)
[14:53:26] <marostegui>	 By the way, I should be able to access racktables.wikimedia.org with my ldap account, right?
[14:53:30] <marostegui>	 Or do I need a separate one?
[14:53:35] <volans>	 nope, separate account
[14:53:38] <marostegui>	 Ah
[14:53:38] <volans>	 unfortunately
[14:53:45] <volans>	 ping rob ;)
[14:53:49] <marostegui>	 Will do :)
[14:53:50] <marostegui>	 Thanks!
[14:53:54] <jynus>	 or hackit yourself, as I did
[14:53:59] <volans>	 rotfl
[14:54:02] <marostegui>	 XDDDDDD
[14:54:06] <volans>	 actually I think anyone can do it
[14:54:12] <volans>	 let me check
[14:54:46] <marostegui>	 grazzie
[15:07:54] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688832 (10Dzahn) Thank you, i will go ahead with adding it to DNS.
[15:23:39] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2688904 (10Papaul) a:03Marostegui Disk replacement complete
[18:38:05] <chasemp>	 got a question on two tables in re: to the labsdb stuff
[18:38:16] <chasemp>	 I'm looking at our candidate for view maintenance and I see it adds to views
[18:38:20] <chasemp>	 two views even :)
[18:38:26] <chasemp>	 +abuse_filter_history
[18:38:26] <chasemp>	 +watchlist_count
[18:38:57] <chasemp>	 so interestingly these views exist in the old perl version but are not actually exposed on the labsdb _p's I spotchecked
[18:39:11] <chasemp>	 I'm wondering if both were removed at some point as bad news but the maint script wasn't fixed
[18:39:25] <chasemp>	 and that would mean whatever two new wiki's were done this year probably have them defined
[18:39:45] <chasemp>	 but I'm not sure atm what exactly to think, both of those seems like decent candidates for not exposing to my naive eye
[18:45:38] <chasemp>	 I also don't see abuse_filter_history here https://github.com/wikimedia/operations-software-labsdb-auditor/blob/master/greylisted.yaml#L821
[19:12:17] <Reedy>	 chasemp: for the abuse filter one... It looks like rows could be exposed based on some flags... Which usually means it's easier to not expose
[19:12:30] <Reedy>	 No idea on watchlist_count
[19:13:05] <Reedy>	 can't see any sign of it our code
[19:17:03] <chasemp>	 thanks Reedy, I dug up https://phabricator.wikimedia.org/T78730#947664 which doesn't equal any specific outcome but I'm guessing someone made the same call at ...some point
[19:18:19] <chasemp>	 Reedy Krenair dug up https://phabricator.wikimedia.org/T123895
[20:52:07] <wikibugs>	 10DBA, 06Labs, 10Tool-Labs: s51127__dewiki_lists (merlbot) database using 13G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133325#2690378 (10bd808)
[21:24:37] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2690580 (10MarcoAurelio) p:05Triage>03Normal
[21:43:09] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 07FR-2016-17-Q2-Campaign-Support, and 2 others: Spike: Look into transaction isolation level and other tricks for easing db contention - https://phabricator.wikimedia.org/T146821#2690701 (10DStrine)
[22:15:43] <wikibugs>	 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2690863 (10Dereckson) We need `$wgFlowDefaultWikiDb` to be set before to be able to crea...
[22:35:22] <wikibugs>	 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2690926 (10Mattflaschen-WMF) a:03Dereckson