Overview on Sampling Strategy

Yes

Sampling Strategy

This Blog Post will cover the 4 sampling Strategies that are available in ThingWorx Analytics. It will tell you how the sampling strategy runs behind the scenes, when you may want to use that strategy, and will give you the pros and cons of each strategy.

SAMPLE_WITH_REPLACEMENT

This strategy is not often used by professionals but still may be useful in certain circumstances. When you sample with replacement, the value that you randomly selected is then returned to the sample pool. So there is a chance that you can have the same record multiple times in your sample.

Example

Let’s say you have a hat that contain 3 cards with different people’s names on them.

John
Sarah
Tom

Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample with replacement, you would put the name Tom back into the hat and then randomly select a card again. For your second selection, it is possible to get another name like Sarah, or the same one you selected, Tom.

Pros

May find improved models in smaller datasets with low row counts

Cons

The Accuracy of the model may be artificially inflated due to duplicates in the sample

SAMPLE_WITHOUT_REPLACEMENT

This is the default setting in ThingWorx Analytics and the most commonly used sampling strategy by professionals. The way this strategy works is after the value is randomly selected from the sample pool, it is not returned. This ensures that all the values that are selected for the sample, are unique.

Example

Let’s say you have a hat that contain 3 cards with different people’s names on them.

John
Sarah
Tom

Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample without replacement, you would randomly select a card from the hat again without adding the card Tom. For your second selection, you could only get the Sarah or John card.

Pros

This is the sampling strategy that is most commonly used
It will deliver the best results in most cases

Cons

May not be the best choice if the desired goal is underrepresented in the dataset

UPSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

This is useful when the desired goal is underrepresented in the dataset. The features that represent the desired outcome of the goal are copied multiple times so they represent a larger share of the total dataset.

Example

Let’s say you are trying to discover if a patient is at risk for developing a rare condition, like chronic kidney failure, that affects around .5% of the US population. In this case, the most accurate model that would be generated would say that no one will get this condition, and according to the numbers, it would be right 99.5% of the time. But in reality, this is not helpful at all to the use case since you want to know if the patient is at risk of developing the condition.

To avoid this from happening, copies are made of the records where the patient did develop the condition so it represents a larger share of the dataset. Doing this will give ThingWorx Analytics more examples to help it generate a more accurate model.

Pros

Patterns from the original dataset remain intact

Cons

Longer training time

DOWNSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

This is also useful when the desired goal is underrepresented in the dataset. In downsample and sample without replacement, some features that do not represent the desired goal outcome are removed. This is done to increase the desired features percentage of the dataset.

Example

Let’s continue using the medical example from above. Instead of creating copies of the desired records, undesired record are removed from the dataset. This causes the records where patients did develop the condition to occupy a larger percentage of the dataset.

Pros

Shorter training time

Cons

Patterns from the original dataset may be lost

Rocko · ‎Sep 12, 2017

Hi John Greiner, it is also possible to select "None" as sampling value. When would this be applied and what would be the effect?

BR

Roman

jgreiner · ‎Sep 12, 2017

Hi Roman,

Yes None is an option. It causes the model to be trained on the entire dataset instead of the sample. It is only recommended to use this with a smaller dataset (only a few thousand rows) because applying it to a larger dataset will add a significant amount of time to the training process.

Warm Regards,

John

Overview on Sampling Strategy

Sampling Strategy

SAMPLE_WITH_REPLACEMENT

Example

Pros

Cons

SAMPLE_WITHOUT_REPLACEMENT

Example

Pros

Cons

UPSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

Example

Pros

Cons

DOWNSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

Example

Pros

Cons

Analytics

Best Practices

Examples

Overview on Sampling Strategy

Sampling Strategy​

SAMPLE_WITH_REPLACEMENT

Example

Pros

Cons

SAMPLE_WITHOUT_REPLACEMENT

Example

Pros

Cons

UPSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

Example

Pros

Cons

DOWNSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

Example

Pros

Cons

Analytics

Best Practices

Examples

Sampling Strategy