This post was quickly written, as it's for a narrow audience and might get outdated. We're happy to provide more rigor and context if requested. Let us know if you are interested in taking any of them on and could use some guidance!
For those not themselves interested in contributing, this might be useful for giving people a better idea of the sorts of challenges we at QURI work on.
1. Density Estimation
Users often want to convert samples into continuous probability distribution functions (PDFs). This is difficult to do automatically.
The standard approach of basic Kernel Density Estimation can produce poor fits on multimodal or heavily skewed data.
a. Variable kernel density estimation
Simple KDE algorithms use a constant bandwidth. There are multiple methods for estimating this. One common method is Silverman's rule of thumb. In practice, using Silverman’s rule of thumb with one single bandwidth performs poorly for multimodal or heavily skewed distributions.
Squiggle performs log KDE for heavily skewed distributions, but this only helps so much, and this strategy comes with various inconsistencies.
b. Performant KDE with non-triangle shapes
Squiggle now uses a triangle kernel for speed. Fast algorithms (FFT) should be possible, with better kernel shapes.
See this thread for some more discussion.
c. Cutoff Heuristics
One frequent edge-case is that many distributions have specific limits, often at 0. There might be useful heuristics like, “If there are no samples below zero, then it’s very likely there should be zero probability mass below zero, even if many samples are close and the used bandwidth would imply otherwise.”
See this issue for more information.
d. Discrete vs. continuous estimation
Sometimes, users pass in samples from discrete distributions or mixtures of discrete and continuous distributions. In these cases, it’s helpful to have heuristics to detect which data might be meant to be discrete and which is meant to be continuous. Right now, in Squiggle, we do this by using simple heuristics of repetition - if multiple samples are precisely the same, we assume they represent discrete information. It’s unclear if there are any great/better ways of doing this heuristically.
e. Multidimensional KDE
Eventually, it will be useful to do multidimensional KDE. It might be more effective to do this in WebAssembly, but this would of course, introduce complications.
2. Quantiles to Distributions, Maybe with Metalog
A frequent use case is: “I have a few quantile/CDF points in mind and want to fit this to a distribution. How should I do this?”
Jonas Moss thinks we can do better than Metalog for most use cases. He wrote the post, Deriving distributions from quantiles, about other potential techniques.
Fitting for Metaculus
Metaculus uses mixtures of logistic distributions as their input method. It would be useful to convert quantile points into this format.
3. “Point Set” Formats
One of the main features of Squiggle is its support for what we call point set distribution formats. This means these distributions are stored as x-y coordinate pairs instead of symbolic representations or lists of random samples. This is useful for several operations (scoring and aggregation, for example) and for visualization. See more discussion here.
Some properties that would be useful to have include:
a. Interpolation Methods
Squiggle currently uses linear interpolation between continuous XY points to approximate distribution shapes. It would be of course, better to do something with curves. However, it’s important to maintain the key properties of probability distributions (no negative probability mass). Ideally, a good approximation could handle varied distributions, from Gaussian to Uniform distributions. It’s possible that splines or similar will be useful here.
b. Continuous + Discrete Mixtures
Handle combinations of continuous and discrete data. Store each with X-Y values. (Squiggle currently does this.)
c. Arbitrary Discrete Distributions
Storing discrete values with X-Y pairs doesn’t work when there’s a very large set of potential X values. For example, a discrete distribution for “What will the US population be in 2070?” might require tens of millions of entries - one for each possibility of the discrete number of people. For these cases, ideally, there could be some limited functional form.
d. Function Tails
Distribution tails are important to approximate but might require specialized methods other than xy point interpolation. The tails of distributions might be best approximated as simple functions. Thus, these formats would include both lists of X-Y points plus additional information about the tail shapes.
e. Domain Limits
This solution would be a bit tricky to make fully functional, but it might be a decent enough hack for simple, improved tails.
f. Multivariable Shapes
The point set formats described above apply to single distributions, but we also want similar formats extending to multiple variables. One simple example would be a distribution that varies within a specific time range. If we could approximate these functions as shapes using coordinates, we could better score/aggregate/visualize them. Performance would be very important for this.
g. Key Mathematic Operations
h. More Thorough Ontologies
The point set shape above could sometimes be a fit for PDFs, sometimes for CDFs, and sometimes for other shapes. When users do certain operations on PDFs, they quickly become more generic functions, but we might want to still store them in a format like point sets. How to best categorize these various types of formatted functions is unclear.
It would be good to have a library that treats each various shape as a certain mathematical object with corresponding valid properties.
It’s not clear what categorization/organization system to apply. Perhaps there are clever ideas from mathematics or category theory to help here. Category theory brought a well-grounded ontology to functional programming; we might be able to use similar in this area.
4. Scoring Heuristics for General Point Set Distributions
Scoring on point set distributions can be difficult if estimators use different approximations. Some known challenges include:
a. Tails with Zero Probability
One frequent challenge we faced with previous distribution scoring is that many users would submit too narrow distributions. With point set formats, it was very easy to approximate the tails as having zero probability mass. This would quickly result in a score of negative infinity.
It would be good to figure out better ways of dealing with this. The cleanest is with a better tail format, as discussed above. But even this is likely not enough, especially for forecasters new to using distributions. We might well desire other approximations or heuristics here.
b. Discrete/Continuous Mixtures
It’s not clear how to best score discrete and continuous mixtures. Scoring algorithms typically apply to only discrete or only continuous data.
One particular complication is when some users submit only continuous forecasts. For example, say there’s a question, “How much will an upcoming Android phone cost?”. Some users might take the time to write a discrete distribution with probability mass at exactly “749.98”, and some might use a continuous distribution. One potential solution is to have the question writer specify specific x-values that can and should get discrete probability mass. Still, it could be nice to have heuristics for situations without this. Another potential solution would be to add some noise.
5. D3 Scale Axes
We’re using D3 to display Squiggle distributions. One challenge is that D3 scales are fairly limited. The Log scale doesn’t work for values below zero; this is problematic because many user-generated distributions are either negative or have negative and positive mass.
It would be useful to have better D3 scales for both situations.
6. Forced Distribution Correlations
There’s a common use case of wanting to force one distribution of samples to be manually correlated with another. However, there are many ways of doing this; it’s unclear what’s best. It probably makes sense to have several options eventually, but it would be good to have one or two robust recommended options.
Tom Adamczewski has recommended working with Copulas for this, here.
7. Language Model Training
As discussed here, there has not yet been any custom training of language models with Squiggle or other probabilistic programming languages in JS (that we know of). This could make for some neat projects and could be fairly straightforward.
8. Distribution-Specific Tasks
a. Beta Distribution Fitting
It would be useful to have a simple function that takes in two specific percentiles (the 5th and 95th are fine) and fits them to a Beta distribution. This likely requires some search method. Nuño Sempere made one attempt here but could use more testing and polish.
b. Extra Distributions
9. Bonus: Relative Value Functions
Relative value functions formats seem useful, but they probably could use a good amount more theory and experimentation.
Caching Value Ratios
Calculating the ratio distributions between any two items to compare is expensive. We want fast and efficient methods of caching and approximating these. The brute-force method calculates and stores every comparison - which would be ~N^2 comparisons. This becomes infeasible with over a few hundred items. It would be preferable to use heuristics to determine which specific comparisons are the most important to calculate and then use results to approximate other comparisons.