Performance comparison
The following comparisons were observed using a basis set of 35 cosine functions,
an x grid of 301 points, and an example data set of 20,449 distributions.
All measurements were made on a desktop computer.
“Baseline” refers to a qp.Ensemble composed of qp.interp
(x,y interpolated) distributions.
“qp_flexzboost” refers to a qp.Ensemble composed of qp_flexzboost
distributions.
Measurement |
Baseline |
qp_flexzboost |
|---|---|---|
Storage size on disk |
48M |
2.9M (>16x smaller) |
Model estimation time |
17.4s +/- 0.3s |
16.8s +/- 0.1s |
Extract median values |
0.7s +/- 0.1s |
9.1s +/- 0.2s |
Plot median values |
59ms +/- 3ms |
59ms +/- 3ms |
Convert to |
0.04s +/- 0.01s |
8.1s +/- 0.1s |
Convert to |
Already |
10.6s +/- 0.1s |
Other considerations
While additional parameters that will affect storage size come into play (the size of the basis set, the resolution of the x grid provided, the number of CDEs stored) the measurements shown in the table above are derived from a reasonable set of parameters a user might select.
Upon further exploration, it appears that XGBoost, the modeling tool that
Flexcode uses, allows at most 38 basis functions.
Any basis functions requested beyond that will be ignored.
Thus for comparison above, the largest output file size for 20,449 distributions
will be approximately 3.3M for the qp_flexzboost representation.
To probe the storage requirements for the baseline, the number of distributions was held constant at 20,449, while varying the number of x values. As expected, decreasing the number of x grid points from 301 to 101 (roughly 1/3) reduced the storage by 1/3 to 16M. Additionally, reducing this x grid points to 20 produced output files 3.3M in size. It is feasible to reduce the size further with fewer x values, however the fidelity of the reconstructed distributions begins to suffer.
It is also important to note that qp_flexzboost represented distributions
allow the user to manipulate the post processing parameters without the need to
rerun the model - x,y interpolated representation do not permit that kind of
manipulation.
Additionally qp_flexzboost represented distributions are lossless, whereas
the x,y interpolated representations become more lossy as the storage size
approaches that of qp_flexzboost.
The x,y interpolated representation is significantly more performant when
operating on the data. One potential workflow may be to store the data in the
native qp_flexzboost representation, and then convert to an interpolated
grid when performing operations on it.
Converting the qp_flexzboost representation to x,y interpolated using with
a grid = np.linspace(0.0, 3.0, 301) (to match the baseline) took approximately
11 seconds.
Given these all of these considerations, it is up to the user to determine when to use each storage representation.
If compact storage is required, or if the ability to experiment with different
bump threshold or sharpening alpha parameters is needed, then using
qp_flexzboost makes sense.
Alternatively, if efficient storage is a lower priority compared to computation speed, the x,y interpolated representation is a better choice.
Finally, depending on the computation requirements, it may be practical to
store the data using the compact qp_flexzboost representation, then unpack
it into a x,y interpolated representation while working with it.
Additional context
Prior to the creation of qp_flexzboost, rail_flexzboost would store
Flexcode output as a qp.Ensemble of x,y interpolated distributions.
This approach is computationally efficient, but also lossy, and the fidelity of
the persisted data was proportional storage size on disk.
To make use of the x,y interpolated representations, the user must define
a grid of x values for which corresponding y values of the Flexcode
conditional density estimates would be stored.
As the x resolution increases the fidelity and storage requirements also increase.
Using qp_flexzboost, we store only the Flexcode output basis function
weights for each conditional density estimate.
This approach is lossless and there is minimal computational impact when
compared to the baseline.