[05:17:08] 10DBA, 10Epic, 10Patch-For-Review, 10codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#4128987 (10Marostegui) [05:17:12] 10DBA, 10MediaWiki-API: Database query error (internal_api_error_DBQueryError) while getting list=allrevisions - https://phabricator.wikimedia.org/T123557#4128988 (10Marostegui) [05:17:19] 10DBA, 10Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#4128985 (10Marostegui) 05Open>03Resolved db1066 is now fixed ``` root@neodymium:~# mysql -hdb1066.eqiad.wmnet enwiki -e "show create table revision\G" *************... [06:43:39] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129039 (10Marostegui) Some more food for thought. The errors happen _exactly_ every 10 minutes almost to the second. Bursts after depooling it from main (according to logtash): 06:20:10 until 06:20:13 06:30:11 un... [07:27:54] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129117 (10Marostegui) So during the errors, normally around 5 seconds or so, there is a burst in connections, which almost double the normal amount of connections. Examples Time and amount of hits on tcpdump to po... [07:38:09] did you get the full trace of those^? [07:54:45] yes [07:54:49] I am going thru them [07:56:18] it is amazing how it is 10,20,30… and the second between 10 and 13 [07:57:47] yes, I said it was bursty [07:57:54] but even to the second? [07:58:06] to the second probably means internal [07:58:08] it cannot be normal traffic [07:58:11] exactly [07:58:16] cronjob for wikidata? [07:58:28] so far I am only seeing mw hosts [07:58:32] connecting to it [07:58:54] yes, but we need to know the queries [07:59:01] http and sql [07:59:12] although if they are failing to connect, no query yet [07:59:18] exactly [07:59:22] there are no queries during those seconds [07:59:26] just connecitons [08:03:29] So during the seconds it lasts I see connections from only mw hosts, db1052 and db1115 (tendril) [08:03:38] nothing like terbium or stuff like that [08:04:15] that is strange, I assume tendril is not a large one contributing to it [08:04:30] otherwise there could be serious issues there with connections [08:08:17] I am trying to think what could we disable to try to isolate traffic or something [08:12:20] maybe we can involve arzhel to see if he can see something at switch level or something [08:13:57] I would put the server offline and do some tests [08:14:05] offline == depool [08:14:28] but if we depool it from API only, errors are gone [08:14:58] that is the weird thing, that whatever it is, it is not moving to another API server [08:16:21] that is why I want to try to break it or do some tests [08:16:54] maybe it gets overloaded and the query killer kills stuff, etc. [08:19:11] let's depool it from everywhere, and see if there is anything arriving to it at those times XX:20:10, XX:30:10 etc [08:19:23] just in case something is hardcoded or internal or something [08:20:55] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129203 (10Marostegui) From the captures, I only see "normal" traffic as in: mw hosts, db1052 (the master) and db1115 (tendril). [08:36:43] I am going to run some tests, tendril on db1114 will fail [08:36:50] for a few minutes [09:14:58] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129264 (10Marostegui) We can now discard tendril for sure as a cause of this (it was hard to believe it was it anyways, but better to confirm it). I used iptables to drop all the traffic coming from tendril DB (db1... [10:11:35] I am going to take a break for an early luch while I wait fo es1013 buffer pool to warm up [10:12:03] enjoy! [10:19:36] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Ministry-Of-Magic-Tech-Debt, and 3 others: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279#4129310 (10WMDE-leszek) [10:21:15] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Patch-For-Review, and 2 others: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279#4002167 (10WMDE-leszek) [10:40:29] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129334 (10Marostegui) >>! In T191996#4126282, @Marostegui wrote: > This host has dropped around 300 packets in 15h or so. > Yesterday I checked the amount of drops in its interface and it was 1815, today it is 2103... [10:56:52] I cannot pool es1013 yes, its cache is still at -243% efficiency [10:57:15] sorry, that was a mistake [10:57:21] I meant -248% efficiency [10:58:04] I think that means that for every request made, 3-4 are found not on cache [11:15:15] jynus: during my lunch break I went adding tox and fixing up stuff for wmfmariadbpy : https://gerrit.wikimedia.org/r/#/c/426004/ :] [11:20:04] thanks for that! [11:21:02] that is extremely helpful [11:23:06] check_health and compare were not really production ready, but you already fixed them, too! [11:24:38] hashar: do you have any direction for how to do test that require heavy backend (sysop) setup? [11:24:46] *tests [11:25:14] e.g. setup a mariadb server with a fake copy of enwiki data [11:26:07] * fully automatic tests (obviously, setup can be done manually) [12:14:10] jynus: I have no clue what the software is doing. But potentially the integration tests coudl take care of spinning a mariadb instance and populate it with test data [12:14:48] to run against a real backend with real data, I guess that can be done manually [12:15:02] (sorry I went out to bring kids to school and walk a bit) [12:15:32] I am not sure that is really feasable- we are talking 1-2 hours to setup the database [12:15:59] could that be a static resource á la beta [12:16:23] so only the frontend is setup each time? [12:17:02] I am basicaly a bit lost, and if you have any idea, not something prioritary right now [12:18:44] (you don't have to know, just in case that you said "oh, we do this like that for the similar thing Y") [12:31:12] jynus: surely we could have CI to point to a database hosted on labs, a bit like the toollabs replica maybe [12:31:31] or just manually run it against an existing db [12:31:55] for CI, possibly one could use a generated set of data. Maybe there is no need to have millions of rows in a test database [12:32:26] yes, unit/validation tests are ok for CI [12:32:44] installing mysql and pregenerating <1MB of data is very fast [12:33:08] which would cover a few functionalities already [12:33:16] but certain tools there will have to take care of complex topology changes [12:33:29] and data provisioning [12:33:31] I should be able to craft a container that has mariadb.deb shipped then the suite could spawn a mariadb server [12:33:53] but for the heavy testing, most probably you want to run it manually against a controlled environment [12:34:34] after all, these are sysop tools, not web tools which means they have higher dependencies [12:34:42] not higher [12:34:44] deeper [12:36:01] ideally we would even have a job that does not pip install at all and uses .deb packages instead :D [12:36:33] oh, I agree with that [12:36:37] that can be done "easily" by using a debian package [12:36:46] and have the debian building toolchain to run the tests in a chroot [12:37:07] which would have no network acccesss and all python modules installed based on the Depends: field in debian/control [12:37:24] so eg one can do the development with random pip installed dependencies [12:37:39] and when the packaging work happen, the test get to run in a more controlled environment that match what is on prod [12:37:44] (well more or less) [12:37:50] yes [12:38:42] I guess there could be 3 levels, CI, staging (production size) and production? [12:39:14] only full releases go to staging [12:39:31] and only staging-approved ones go to production [12:39:40] (manually) [12:41:28] in the SSD pipeline project (which basically overall everything) that is more or less the idea [12:41:52] the software will run in more or less the same environment either locally, on CI, in staging and in production [12:42:34] I guess once a version is polished and works fine in CI, a release candidate is cut [12:42:46] SSD for us is disks, what does that really mean for you? [12:42:47] then the release candidate is tested on staging with a real dataset [12:42:56] and once polished up, a final can be cut and moved to prod [12:43:08] Strealined Software Deployment iirc [12:43:16] oh, that is another question, but more general [12:43:18] or maybe the project is now named "deployment pipeline" [12:43:24] should staging have real data or not? [12:43:47] but the idea is that eventually one send a patch to gerrit, vote +2 and stuff get tested and ultimately deployed to prod automatically [12:43:48] it wasn't clear on my department [12:44:04] for staging I dont know. That depends on what we want to test [12:44:05] if test production-sized databases [12:44:21] sould be a brokable replica or one with fake data [12:44:22] most probably it is easier to just have a snapshot that get refreshed from time to time [12:44:26] with private data stripped [12:44:51] or yes maybe fake data is good enough. Then it might be challenging to generate good fake data [12:44:58] well, that is the more difficult of the 3 [12:45:02] when a curated snapshot is probably easier to setup [12:45:14] (real data with everything private stripped) [12:46:01] we should talk more about aims for that in the future [12:46:22] I guess we can use the ops mailling list for that [12:46:38] (and people not interested can just mute the thread ) [12:48:23] wmfmariadbpy now has tox running in CI: https://integration.wikimedia.org/ci/job/tox-docker/1622/ :] [12:49:55] thank you again! [12:50:06] jynus: and if you ever want to spawn a mariadb database with the current user and a file socket to write to: https://github.com/wikimedia/integration-quibble/blob/master/quibble/backend.py#L138-L159 [12:50:32] (which really should only take 6 seconds for you to figure out, but took me a good chunk of time to get right) [12:50:38] oh, I didn't know that existed [12:50:58] which db does it create,etc.? [12:51:01] that is to spawn a mariadb instance in the background and then have mediawiki installed using it [12:51:09] (version?) [12:51:14] so we can have multiple mariadb spawned when multiple jobs run on the same instance [12:51:36] then I create database, GRANT stuff https://github.com/wikimedia/integration-quibble/blob/master/quibble/backend.py#L121-L136 [12:51:57] and finally can run the mediawiki installer against that using something like php maintenance/install.php --dbpath=/tmp/mariadb.socket [12:51:59] \o/ [12:52:28] so then i can run multiple mediawiki tests in parallel on the same host, and each has its own little database! [12:53:03] this may actally also be helpful to that project in the future [12:53:14] so for wmfmariadbpy integration tests, potentially the integration tests could try to setup a mariadb instance for itself [12:53:26] it could even be a fresh one for each test (to be setup in setUp() and killed in tearDown() [12:53:27] the idea is to abtract common db tasks [12:53:54] and then get an option to point to an existing database using environment variable like: DATABASE=foo [12:54:13] if DATABASE is set, setUp() would not spawn a fresh instance but use whatever has been indicated via env variable [12:54:28] so on CI, with no database, the integration test suite would spawn the db and run tests against it [12:54:45] locally you would be able to run the integration tests against an existing instance, simply by setting an env variable [12:55:02] that is all in theory obviously [12:55:26] feel free to reuse the code at https://github.com/wikimedia/integration-quibble/blob/master/quibble/backend.py [12:55:42] though there is a bit of boilerplate to make it act as a context manager. So I can do something like: [12:55:46] with MySQL(): [12:55:49] php install.php [12:55:52] php phpunit.php [12:55:57] print("done") [12:56:19] (and mysql get killed when the context is exited, probably before "done" get printed [13:55:35] es hosts seem quite happy with 10.1, so I may reimage codfw masters next week [14:27:30] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129748 (10Marostegui) After changing the port configuration and for the records, this is what the interface is showing ``` RX errors 0 dropped 2487 overruns 0 frame 0 ``` We'll see what happens once the server... [14:28:58] 10DBA, 10Operations, 10netops: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129750 (10jcrespo) Adding the tag to reflect work done at network layer. [15:08:05] 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129814 (10ayounsi) ```name=db1114 ethtool eno1 Supported pause frame use: No Advertised pause frame use: Symmetric Link partner advertised pause frame use: No ``` ```name=db1114's switch... [15:09:22] 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129816 (10Marostegui) @ayounsi thanks for your help. If you want to compare it with the other two servers that receive exactly the same traffic, those are: db1066 and db1080. [16:26:42] 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130018 (10ayounsi) 1/ Flow-control not helping, reverted 2/ Are the other servers seeing the same bursts of inbound sessions? 3/ The `ifconfig` input drop counter matches the nic stats... [16:27:07] 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130019 (10Marostegui) So given that db1066 and db1080 have the same traffic than db1114 (and even more when db1114 gets depooled from API) and they don't suffer any kind of issues, could... [16:34:14] 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130039 (10Marostegui) >>! In T191996#4130018, @ayounsi wrote: > 1/ Flow-control not helping, reverted > Cool > 2/ Are the other servers seeing the same bursts of inbound sessions? Th... [16:43:04] 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130046 (10Marostegui) [22:10:30] 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4130987 (10Bstorm) So should `content.content_address` be NULL in replica views? Just trying to clarify how the comments her... [23:30:03] 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4131129 (10daniel) content.content_address doesn't have to be nulled. There will just be no mechanism on labs for resolving t...