Sampling Strategy

This Blog Post will cover the 4 sampling Strategies that are available in ThingWorx Analytics.  It will tell you how the sampling strategy runs behind the scenes, when you may want to use that strategy, and will give you the pros and cons of each strategy.

SAMPLE_WITH_REPLACEMENT

This strategy is not often used by professionals but still may be useful in certain circumstances.  When you sample with replacement, the value that you randomly selected is then returned to the sample pool.  So there is a chance that you can have the same record multiple times in your sample.

Example

Let’s say you have a hat that contain 3 cards with different people’s names on them.

  • John
  • Sarah
  • Tom

Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample with replacement, you would put the name Tom back into the hat and then randomly select a card again.  For your second selection, it is possible to get another name like Sarah, or the same one you selected, Tom.

Pros

  • May find improved models in smaller datasets with low row counts

Cons

  • The Accuracy of the model may be artificially inflated due to duplicates in the sample

 

SAMPLE_WITHOUT_REPLACEMENT

This is the default setting in ThingWorx Analytics and the most commonly used sampling strategy by professionals.  The way this strategy works is after the value is randomly selected from the sample pool, it is not returned.  This ensures that all the values that are selected for the sample, are unique.

Example

Let’s say you have a hat that contain 3 cards with different people’s names on them.

  • John
  • Sarah
  • Tom

Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample without replacement, you would randomly select a card from the hat again without adding the card Tom.  For your second selection, you could only get the Sarah or John card.

Pros

  • This is the sampling strategy that is most commonly used
  • It will deliver the best results in most cases

Cons

  • May not be the best choice if the desired goal is underrepresented in the dataset

 

UPSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

This is useful when the desired goal is underrepresented in the dataset.  The features that represent the desired outcome of the goal are copied multiple times so they represent a larger share of the total dataset.

Example

Let’s say you are trying to discover if a patient is at risk for developing a rare condition, like chronic kidney failure, that affects around .5% of the US population.  In this case, the most accurate model that would be generated would say that no one will get this condition, and according to the numbers, it would be right 99.5% of the time.  But in reality, this is not helpful at all to the use case since you want to know if the patient is at risk of developing the condition.

 

To avoid this from happening, copies are made of the records where the patient did develop the condition so it represents a larger share of the dataset.  Doing this will give ThingWorx Analytics more examples to help it generate a more accurate model.

Pros

  • Patterns from the original dataset remain intact

Cons

  • Longer training time

 

DOWNSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT

This is also useful when the desired goal is underrepresented in the dataset. In downsample and sample without replacement, some features that do not represent the desired goal outcome are removed. This is done to increase the desired features percentage of the dataset.

Example

Let’s continue using the medical example from above.  Instead of creating copies of the desired records, undesired record are removed from the dataset.  This causes the records where patients did develop the condition to occupy a larger percentage of the dataset.

Pros

  • Shorter training time

Cons

  • Patterns from the original dataset may be lost