I’m back on CrimsonCache for a while. But it dawned on me that it’s worth laying out my reasons for making a synthetic dataset vs finding one that is real since I have a strong preference for real data. It comes down to four reasons.
Overcomes data scarcity. I don’t have access to this sort of data. I’m not sure anyone one other than blood banks do and they don’t seem to be keen to make it available at this point.
Eliminates any data privacy concerns. Actual blood bank data would have a risk of a HIPAA violation. I think it could easily be mitigated but synthetic data can completely eliminate it.
Can customize data to develop some specific ML models. Synthetic data can be tailored to test a set of specific conditions such as a donor population aging out, or the effect of a mass donation event.
It can be used for multiple purposes. I originally intended this to just be for SQL practice. But with a just a little more work it can be used for data engineering, data analysis, dashboarding, and statistical modeling.