omitted charge values will be my downfall. I think I really will need to parse the [ atomtypes ] forcefield parameter definitions...
omitted charge values will be my downfall. I think I really will need to parse the [ atomtypes ] forcefield parameter definitions...
okay, starting, just wrote the
// This is where this gets extremely cursed.
line so now we can start.
so, yeas, the temptation to actually just load everything into a large ass string in memory is growing.
or rather, I could still make it Nice, by storing a Vec of rich Spans, which also contain file and line number information.
this has another great advantage: I could do some cycle detection a bit more easily.
I'm quite convinced I can get away with it because the largest itp file I know of contains an entire chromosome (coarse-grain but still) with close to a million particles, and it is "only" 330MiB.
that is by far the largest single structure we deal with. even if it's 10× larger, it would still be very possible.
@ma3ke if it can fit in memory i would say do it. the OS has lots of paging to make it work no reason not to take advantage of that
@ma3ke linux has transparent hugepages that may be interesting
@hipsterelectron it's not even a very performance-sensitive application since it's quite easy and scales linearly with the number of lines contained by the total of the files the program needs to go through.
the currently working alternative is to just run GROMACS and see if it complains about a non-zero total system charge for the molecules in the topology, and then you take that number and pass it to the --charge flag for bentopy-solvate.
instead, we want to be able to say --charge neutralize or smth and it figures it out _just like GROMACS does_ but quickly and in-situ.
@ma3ke i know for rapid string search that using i/o like read()/write() is much faster than mmap() but if you're doing anything more complex than SIMD literal search then mmap() may be more competitive and the simplicity of a linear array sounds very nice
moreover, I can actually be very smart about it, and only store spans that will actually be useful to me. or rather, meta spans that contain, for instance, the atom lines from the [ atoms ] section, rather than all of the other directives as well. then the memory footprint would be relatively modest.
point is, I think that trying to do this all in a streaming approach is just way more work for relatively little advantage.
one of my goals with implementing this is that the implementation is easy to follow, in terms of its ordering and semantics. so that's another advantage.
ALSO, if need be, I can parallelize the charge summing part very nicely, especially when it's all laid out in a read-only in-memory list.