All compression functions are currently only running on the calling goroutine so only one core will be used per block.
The compressor is significantly faster if symbols are kept as small as possible. The highest byte value of the input is used to reduce some of the processing, so if all your input is above byte value 64 for instance, it may be beneficial to transpose all your input values down by 64.
With moderate block sizes around 64k speed are typically 200MB/s per core for compression and around 300MB/s decompression speed.
The same hardware typically does Huffman (deflate) encoding at 125MB/s and decompression at 100MB/s.
At one point, more internals will be exposed to facilitate more "expert" usage of the components.
A streaming interface is also likely to be implemented. Likely compatible with FSE stream format.
Contributions are always welcome. Be aware that adding public functions will require good justification and breaking changes will likely not be accepted. If in doubt open an issue before writing the PR.