`trade.py trade` definitely does more work than it should #182

ultimatespirit · 2024-06-08T08:38:43Z

ultimatespirit
Jun 8, 2024

Calling trade.py source destination appears to attempt to load the entire database and then perhaps do some additional processing considering the amount of time it takes.

This is obviously wasteful, it has been given to exact starting and end zones, all it really needs to do is pull out the data for both of those locations and then compare them. And if pulling out the data requires loading the entire database rather than searching the database with much lower memory requirements... then that sounds like a pretty major usability issue considering how large the database is now.

I've noticed this for all the commands really, they seem to spend tons of time loading the entire database and then doing the culling operations, if even, requested of the operation, making all queries slow. The load could be avoided by just making the application a REPL, something I believe that has been mentioned in previous issues, or some sort of (local) sever-client architecture, though making cases that obviously don't need a full load not do the full load would be a good step too.

ultimatespirit · 2024-06-08T10:27:52Z

ultimatespirit
Jun 8, 2024
Author

After consuming around 12GB of memory (my system only has 16GB, so I'm unsure if that 12 is an upper bound, or if it just has a "reasonable" percent of using 66-67% total system memory) and 2-3 minutes of wall time, the following command was finally able to output something (granted said something is still not the most helpful but that's a separate issue I've brought up before [1]):

/usr/bin/time -f '\nreal\t%Es\nuser\t%Us\nsys\t%Ss\nMax mem\t%M kb' ./trade.py trade "lushertha/de caminha station" "Anlave/Brunton Gateway" -vv
<Trade output elided>

real	2:27.16s
user	122.43s
sys	9.72s
Max mem	11641064 kb

Using GNU time.

For comparison, the actual commodities buy command that even got me those points to check their trades of:

/usr/bin/time -f '\nreal\t%Es\nuser\t%Us\nsys\t%Ss\nMax mem\t%M kb' ./trade.py buy --near lushertha --ly 23 --age 0.5 -S "basic medicine" "non-lethal weapons" "reactive armour" --limit 10
<Actual trade output elided>

real	0:28.75s
user	20.67s
sys	2.18s
Max mem	155708 kb

A command that at minimum had to be executing queries on distance and multiple queries on commodity prices used 155.7MB of memory and 30 seconds of time, versus a command to output two precise location's data diffed against each other and then sorted taking 150 seconds and 11.6 GB of memory. And yes, if it weren't obvious, I was interested in how trade.py would fare for the community goal haha.

Anyway, something has clearly broken here, and I'm trying to resist the urge to dive into the code at the moment to root cause it versus actually playing the game and enjoying that CG.

(P.S. [1] refers to my previously opened issue #143, with a subitem asking for trade.py trade to please output results for both directions of trade, or to include an option to output that. If this issue causes anyone to audit what trade.py trade is doing and rewrite portions of it, pretty please add that logic for bi-directional trade too haha)

(P.P.S. It would also be pretty nice if the buy command had an option to output the profit if the found commodity is sold at a specified station (and defaulting that station to the system / station specified by --near, if a system choosing the best station in the system per commodity I guess)

0 replies

ultimatespirit · 2024-06-12T09:26:35Z

ultimatespirit
Jun 12, 2024
Author

By the way, the reason trade is doing so much is because tradeCalc(...)'s init will load the entire database as no environment will be present to clamp down on age, or any other parameters. This occurs before even sanity checks occur so something like trade src src should result in a huge load time followed by the error message for having the same name twice.

Refactoring tradeCalc is non-trivial, since I assume it's used as part of run, though it likely should be done considering how the SQL processing could be improved to reduce work in Python and __init__() doing so much there makes it harder to reuse parts of its logic versus having an explicit loadItems() function or something.

I saw that the buy command uses direct SQL statements, doing something similar for trade will likely be the easiest solution prior to any work to make tradeCalc more easily used for a general case like this. If I get time this weekend I'll try to do that.

P.S. This line in the buy command should use the next() function to avoid constructing an entire list just to throw it away for its first element. I believe this is the time to once again point at that @kfsone guy's past decisions ;)

0 replies

kfsone · 2024-06-12T20:54:27Z

kfsone
Jun 12, 2024

Don't underestimate my potential for derp -- I totally started TD as a learn-me-some-python project; to emphasize that:

P.S. This line in the buy command should use the next() function to avoid constructing an entire list just to throw it away for its first element.

queries isn't that big, and likely when I first wrote that I was flapping my arms at the Python 2.7/Python 3 disparity in behavior of dicts and iterators. TIL: While Python 2.7 had ivalues() Python 3 folded it into values(), but - subtly - stopped producing an iterator:

https://python.godbolt.org/z/nn7KrEqv8

and that may be as far as I looked?

I also wouldn't known of/thought of next(iter(queries.values())) to test if it was more efficient at that point.

https://python.godbolt.org/z/a9WxcrEPb

This was also probably before I discovered I wasn't going to be able to scale performance with threading :)

There's another issue/discussion elsewhere where I have suggested what was always my original plan, for TD to be-its-own-service so that the data is loaded once. SQLite was just the most accessible ACID-storage option I had at the time, I didn't actually intend to use it as a database, and that's really what you're seeing in this code, is my trying to avoid the lure of writing SQL queries believing that to do so would make it harder to funnel things through TradeDB rather than pure SQLite.

I seem to remember that scoped-loading either existed or was planned. I don't know whether I simply screwed it up with order of operations in this one or the sql-creep bent this out of shape.

I've also commented elsewhere that a HUGE problem with the station-items table is that it's a single table instead of separate buying/selling tables, which results in a pair of "x_price > 0" indexes which turn out to have logarithmic complexity maintaining - so when the db was much smaller it was an optimization, now it's a giant hindering turdbuger :(

I was working on changes to move those into their own tables, alongside changes to also change the cache build so that it doesn't rebuild the file from scratch every time, allowing the db to benefit from reuse.

https://github.com/kfsone/Trade-Dangerous/tree/kfsone/inplace-cache-rebuild
https://github.com/kfsone/Trade-Dangerous/tree/kfsone/import-via-table

0 replies

Tromador · 2024-06-13T01:48:46Z

Tromador
Jun 13, 2024
Collaborator

SQLite was just the most accessible ACID-storage option I had at the time.

When I was very much younger, I studied chemistry at university and also took a lot of ilicit mind altering drugs. Either way "acid storage" means something unrelated to computers, could you educate me?

when the db was much smaller it was an optimization, now it's a giant hindering turdbuger :(

The utterly ridiculous database size is a large (if not the exclusive) reason we've started routinely purging data more than 30 days old. To put some numbers to this, before the purge, the DB was ~5.5GB in size, it's now ~500MB. So the vast majority of the data available (for example from spansh) is wildly out of date, sometimes literal years old. This should also hugely reduce the daily listings.csv download size, which has been mentioned elsewhere.

0 replies

kfsone · 2024-06-13T05:00:59Z

kfsone
Jun 13, 2024

Atomic, Consistent, Isolated, Durable; if you're in the middle of writing something to the disk, it can't end up in an unrecoverable state.

Say the first 1KB of your file is an index, four bytes of an ID and four bytes of an offset into the file the data for that ID as at.

e.g.

(3612, 8000) - meaning offset for id 3612 is end-of-index + 8000 bytes.

You get an update that 3612 was deleted and 3613 added, but the new record is bigger so it has to be relocated, and it turns out there's a big enough free space at offset 2000.

writing this data into your file is going to take a non-trivial quantity of time - dozens or even hundreds of cpu instructions.

say the code does:

seek(index_position)
write(new_id, new_offset)

There's no data at the new offset, you didn't move the old data into the free list, and the old data is still there but now orphaned?

If those values are something to do with payments, you just threw money away.

There are ways you can work around it to provide stability, a common one being that you write the steps you need to execute to a "journal", and you don't allow access to the data while the journal is non-empty.

After that you still have to be careful not to put the data into an intractable state, but you can at least recover from crashes or power outages.

https://en.wikipedia.org/wiki/ACID

Unfortunately, as I was building up TD, I kept finding SQLite to not be all that; it was fairly routine for it to render the database unrecoverable. It was still giving me headaches at work up until about 4-5 years ago but - touch wood - haven't seen one in at least 4 years.

0 replies

ultimatespirit · 2024-06-13T05:17:39Z

ultimatespirit
Jun 13, 2024
Author

I also wouldn't known of/thought of next(iter(queries.values())) to test if it was more efficient at that point.

I didn't even notice that queries was a dict, based off of the whole single mode thing I guess it was supposed to just be a single item dict though? Since, even had you known you could do next(iter(...)), or for that matter next(v for v in queries.values()), dictionary order wasn't made consistent officially until python 3.7, which wouldn't have been out yet by the time that code was written. So it would be malformed for anything that isn't a single element dictionary, which I guess this is. In which case as odd as it looks I guess list(queries.values())[0] really is the only (easy) way to do it lol...

is my trying to avoid the lure of writing SQL queries believing that to do so would make it harder to funnel things through TradeDB rather than pure SQLite

I did notice that, and you've mentioned it somewhat elsewhere IIRC. Honestly, it may make sense at this point to just migrate to a pure NoSQL key-store instead of using SQL at that point. Though I'm honestly surprised that sqlite is so inefficient as to not be feasible to offload to for concurrent lookup at least, given Python's inability to do threading and poor multi-process data passing ability (pre 3.13 at least, if that ever materialises). For all the areas where additional pure-data checks are being done in Python rather than SQL that is (like "is the price less than this number?" or "Is the pad size.." etc.).

0 replies

Tromador · 2024-06-13T06:24:03Z

Tromador
Jun 13, 2024
Collaborator

I'm a sysadmin by trade, not a programmer. I'll hack crappy code together to achieve something simple when I've no other choice. So a lot of this goes over my head. My question therefore is how much of these discussions are realistically achieveable? I've done a lot of project management and this project has zero budget very limited resources, so I look at this and whilst it may be academically interesting discussion, does it lead to something that can be made to happen with reasonable expectations on time input for all concerned?
I mean, by all means carry on the discussion, but if this is all pie in the sky, then also consider if there's some lower hanging fruit we can more easily deal with to improve the state of the software.

0 replies

ultimatespirit · 2024-06-15T22:11:55Z

ultimatespirit
Jun 15, 2024
Author

This was, and still is, an actual issue. The discussion parts came after, but I'm pretty sure trade.py trade being functionally unusable should count as an issue, and not a discussion.

Anyway, since it is a discussion now, I guess I'll use this to ask, is the sqlite database getting ANALYZE or PRAGMA optimize done to it at all? A quick rg -i optimize and rg -i analyze on the code base returned nothing much, the eddblink plugin runs VACUUM if you tell it to optimise and that seems to be it. Is there a fixed analysis file generated through ANALYZE being applied to the database somewhere and that's why no ANALYZE / PRAGMAs are invoked? Or, just not something that's been done / considered before? I'm learning SQL optimisation tricks as I go here, so wouldn't be surprised if there's some reason to not be using ANALYZE, but certainly seems like a nice thing to run against a large DB like what we've got here.

To the issue itself, I have the following SQL query tested in sqlite3 right now, gets results in 10-15ms with an already open sqlitebrowser session and I think I avoided as many non-unique gotchas as I could.

WITH possible_stations(stid, syid) as (
    SELECT station_id, system_id from Station
	WHERE name = "Walker Port"
), actual_station as (
    SELECT stid, syid from possible_stations
	where (SELECT name from System WHERE system_id = syid) = "HIP 14886"
)
SELECT * from StationItem
where station_id in (
    SELECT stid from actual_station
)

Obviously, need to replace the system and station name specifications with variable entries, but chose that one since there are a ton of "Walker Port" stations so nice way to test the non-uniqueness criterion. I have it setup this way because I saw there was an index set for "find a station by name" (non-unique) but no index setup for "find a system by name" (unique), so best way I could think of was "find all stations with that name, then filter to stations whose system id matches the system name". Incidentally, is there a reason there's an index to find stations by name but not systems by name? It seems like it would make more sense to filter by system, since it's unique, and then filter on the system_id, since there's an index for that too, with stations. Maybe I'm misunderstanding the way the DB is setup or some use case there, or it has to do with how foreign keys work (I'm not super familiar with what optimisations SQL can do with those constraints, if they link it directly then makes sense that finding a station by name would let you immediately check its system name too).

Unfortunately, I couldn't quite figure out a clean way to get it to, in one query, return two system-station pair's data without running into uniqueness issues while avoiding JOIN. Obviously could do WHERE name in (foo, bar) and WHERE (get system names who matched) in (baz, boo), but if a boo/foo or baz/bar system-station pair existed you'd get more results than you wanted. Probably simpler to just execute both queries separately (maybe using async IO sqlite3).

Here's a solution relying on JOIN (that also drags in the item stats rather than assuming we have that loaded in Python already to query against):

SELECT * FROM StationItem, Item
WHERE station_id IN (
    SELECT station_id FROM Station, System
	WHERE Station.system_id = System.system_id AND
	    ((Station.name = "Galilei Port" AND System.name = "LHS 1348") OR
	     (Station.name = "Haxel Terminal" AND System.name = "Warwal"))
) AND StationItem.item_id = Item.item_id

It performs basically within the same standard deviation of a single query using CTEs above, so probably is fine. I remembered some vague tribal knowledge of "JOIN with way too large tables can lead to massive inefficiencies", but I'm guessing that either was never true or got optimised better sometime in the last oh 10-20 years since I stared at SQL... The LHS 1348 system has a Haxel Terminal too, the above query properly avoids that (as it should).

I'll see about prototyping that into an actual python script and measuring its performance, for sure it'll beat the current implementation of trade since it's so pathological right now.. Whatever I do make though I'm very likely to just do it using pure sqlite3 (or its async version perhaps, seems overkill for trade though) rather than direct tdb for now, as I haven't had time to fully audit the DB code and don't trust it to scale. I.e. make it as independent as possible to see what sort of baseline performance to expect and then can work on incorporating those improvements, if they don't immediately match with tdb, into the database implementation. I'm especially curious how this will compare with the market subcommand which also mostly uses direct SQL and takes about 30 seconds, I'm guessing a large portion of the startup time is dominated by just trying to open a connection to the SQL database, and will be unavoidable. Would be nice to consolidate the behaviour into one base function either way, since fundamentally this is just "get me market data for N stations" where N=2 (that general case could be interesting for letting someone find a trade loop amongst N specified stations, I guess, though seems a little convoluted).

That actual prototype though will likely be either later this weekend or when I get time later, as I attempt to actually still have fun playing the game rather than doing tools for the game..

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`trade.py trade` definitely does more work than it should #182

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

trade.py trade definitely does more work than it should #182

ultimatespirit Jun 8, 2024

Replies: 8 comments

ultimatespirit Jun 8, 2024 Author

ultimatespirit Jun 12, 2024 Author

kfsone Jun 12, 2024

Tromador Jun 13, 2024 Collaborator

kfsone Jun 13, 2024

ultimatespirit Jun 13, 2024 Author

Tromador Jun 13, 2024 Collaborator

ultimatespirit Jun 15, 2024 Author

`trade.py trade` definitely does more work than it should #182

ultimatespirit
Jun 8, 2024

ultimatespirit
Jun 8, 2024
Author

ultimatespirit
Jun 12, 2024
Author

kfsone
Jun 12, 2024

Tromador
Jun 13, 2024
Collaborator

kfsone
Jun 13, 2024

ultimatespirit
Jun 13, 2024
Author

Tromador
Jun 13, 2024
Collaborator

ultimatespirit
Jun 15, 2024
Author