Data Science Quick Reference Glossary

    • Advanced Analytics – A grouping of analytic techniques used to predict future outcomes.
    • AverageGoal – The average goal value for all of the records in a cluster or profile that match the specific values or conditions. Average goal is useful for both ordinal and Boolean goals.
    • Backpropagation – Backward propagation of errors (backprop) which is used for Training and Scoring operations.
    • Binning – The process of replacing real values with a nominal representation of the range of values (i.e., a bucket) that can be useful for correcting outliers or flattening the distribution of values. For example, the following set {0,1,2,3,4,5,10} could be binned into { [0-1], [2-3], [4-5], [10+] }.
    • Categories – An optional tag that identifies which groupings or categories to which a data element belongs.
    • Range – The acceptable range of values for the feature.
    • Cloud Computing – Using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server (locally hosted) or a personal computer.
    • Cluster(s) – A subset of data that shares mutual characteristics useful in determining commonalities in data.  By default, an investigation will create 2, 3, 5 and 7 clusters of information, and an investigation must have at least 1 cluster operation.
    • Cluster Scoring – Analysis which assigns a cluster to a previously unseen record. A model must first be created based on a sufficient sample of Training data; this model is then applied to a Scoring record(s), and a cluster is assigned.
    • ClusterID – The cluster or cluster path that the record is closest to. In a cluster model with a single level, it will be the unique ClusterId. In the case of a hierarchical cluster set, it will be the full path.
    • Correlation – Any broad statistical relationships between two independent variables involving dependence.
    • Data Configure (service) – Service to create and/or alter a dataset.
    • Data Mining – The computational process of discovering patterns in large data sets to extract information and transform it into an understandable structure for future use.
    • Data Normalization – The process by which Neuron internally transforms values into a decimal range between 0 and 1.  This process also pivots string values into columnar Booleans (see Nominal Expansion).
    • Data Science – The extraction of knowledge from large volumes of data leveraging the field of data mining and predictive analytics.
    • Data Submit (service) – Service to submit data into a configured dataset.
    • Data Type – The Neuron-specific datatype.  Can be the following:
        • String: An alphanumeric value conforming to the UTF-8 code page
        • Ordered String: An alphanumeric value conforming to the UTF-8 code page, presented as an ordered list of values. All string values must be represented in the list.  (See Ordinal)
        • Integer: A positive or negative whole number not containing a decimal point
        • Double/Numeric: A positive or negative number which may or may not contain a decimal point
        • Boolean: A value representing true/false
    • Dataset – A file to be used for Neuron consumption.  In self-service, a file is uploaded to be used for investigation. This container is defined by Configure Data and used to hold Data Submitted via API.
      • Note: This self-service file has to meet the following criteria:
        • It must be in CSV format (comma-separated ",").  Neuron cannot interpret quoted identifiers or commas in strings.
        • It must not have a header.
        • It cannot contain special characters.
        • The number of columns must match the number of columns specified in the Metadata file.
    • Dependent Variable – The variable in an experiment that is being measured (See also: Topic, Goal, Objective).
    • Descriptors – Features within a dataset that cannot be easily or meaningfully altered within the context of an experiment.
    • Density – Measure (from 0-1) of how significant a specific feature is in differentiating a cluster from the other clusters in a set.
    • Display Only – Feature within a dataset that is used strictly for post-analysis reporting but is explicitly withheld from actual automated analysis.
    • DisplayOnly – Allows data to be used for internal Neuron query and data result purposes but not within the Neuron learning algorithms as an input.
    • Distance – The percent difference between the record scored and the average of a cluster.
    • Dumpnet – A web interrogation service used to query the serialized model object to determine overall model [RMSE] as well as the RMSE of individual scoring bins and training and validation results.
    • Elite Average (Neuron Default) – a combination of Average and Best Ensemble. The ensemble chooses the Best/Elite learners based off of how the learners did in training. Each item in holdout is individually scored by each learner.
    • EnsembleTechnique – The ensemble learning technique used by Neuron to minimize prediction error.
    • ExclusionList – A list of any features to be excluded from the analysis job.
    • (Nominal) Expansion – The process of pivoting normalized data into Boolean values.  Expansion occurs after normalization, and expanded datasets are used for Backpropagation Operations (Training and Scoring).
    • Expression – The evaluation criteria for a feature.
    • Feature – Any column of a dataset which can be used for analysis.
    • Feature Selection – Selection a subset of releveant features to be used when creating the model.
    • Featuresatatime – The sequence of the feature when analyzed in combination.
    • Filters – A set of logical criteria applied to features of a dataset in order to limit the dataset to a particular subset of records.
    • Flat File – A file containing records that have no structured interrelationship (rows and columns).
    • Grid – An AWS component which runs Neuron jobs.
    • Goal – A feature which is the dependent variable of a dataset or the topic of an investigation.
    • Groups – (See Filters)
    • Identifier – The primary key (unique row identifier) within a dataset that is often logically derived from the data being analyzed.
    • ImportantFeatureCount – Indicates the number of causal factors you want Neuron to return with each scored result. This is a non-negative whole number which represents a reasonable number of features available for their dataset.
    • Independent Variable – All data used to predict the outcome.
    • Informationgain – Information gain represents how much extra information one receives by knowing a given fact or set of facts.
    • Investigation – A workspace created in Neuron Self-Service that may have the results of an investigation or just the setup information.  An investigation typically consists of a topic, a data source (dataset), and the resulting Signals and Clusters.
    • ImportantFeatures – A list of features and weights that were used in the scoring process. May contain between 0 and 5 important features. Neuron will return the most important features for the score up to the requested amount. If features had no influence on the score, it will not be returned. Only features that had an influence on the score will be returned.
    • K-Nearest Neighbors – The pattern recognition non-parametric algorithm used by Neuron during Scoring to calculate the distance between the scoring record and training examples.
    • Hierarchy – How many clusters Neuron will generate. Users can select to have only primary clusters generated [ x ] or can have both primary and secondary clusters generated [ x , y ].
    • LastModified – The date and time the job configuration parameters were modified.
    • learningTechnique – Indicates the specific learning techniques to be included in the array of learners. Available learners include: LINEAR_REGRESSION, LOGISTIC_REGRESSION (Recommended to only be used with Boolean goals),DECISION_TREE, RANDOM_FOREST, NEURAL_NET and GRADIENT_BOOST.
    • LearningComplexity – The depth of learning Neuron performs. The range is from 1 to 6. Level 1 is the most simple and fastest learning time. The amount of time gets exponentially longer as the value increases and allows Neuron to perform deeper learning on the dataset.
    • Learners – A collection of learners or modeling techniques can be used when building a prediction model. More than one can be included but there must be at least one defined.
    • Lever – A flag which indicates whether (True) or not (False). This feature can be used as a dynamic value in simulations or prescriptive scoring.
    • Location Key(s) – Feature(s) that are designated through the data configuration service to identify locality.
    • Machine Learning – A subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence and explores the construction and study of algorithms that can learn from and make predictions on data.
    • Maximize (Objective) – Indicates Neuron to search for profiles that achieve the highest (true) or lowest (false) outcome for a profile.
    • Metafile/Meta – A file which describes the dataset.
    • Modeling (data) – The process of transforming data into a Neuron-ready dataset.
    • MI (Mutual Information)/MI Score – A measure of the degree of dependence between a particular signal and the topic. The higher the MI, the stronger the dependence. Generally, MI is the preferred means of quantifying the “strength” of the signal of a given feature. MI is not directional (i.e., it does not imply a positive or negative correlation between feature and objective), and it only shows that a high amount of information is shared between the two.
    • Nominal – A piece of data (usually a String) that is only differentiated in value by its name.
    • Nominal Ordering – Allows for rank order (1st, 2nd, 3rd, etc.) by which data can be sorted, but does not allow for relative degree of difference between them.
    • Nominal Sampling – The process of restricting the number of nominal values based on the number of occurrences of the value within the set.
    • Normalization – The internal process by which Neuron assigns values to elements in an input dataset.  Normalized datasets are used for Signals, Clusters, and Decision Trees.  Data types are converted in the following manner:
      • Numeric values in a column are converted to values between 0 and 1 based on the minimum and maximum values and the relative distance of all other items in that set
      • Boolean values are converted to 0 and 1
      • Strings are replaced with numeric representations relative to the number of items in the set.  For instance, {High, Low, Medium} would be replaced with {0, 1, 2}.  If ORDER is specified in a list of values (e.g. Low, Medium, High), the sample dataset would be replaced with {2, 0, 1}. If no order is specified, the string will be replaced by the order in which it appears in the set. 
      • Name – The unique identifier for the name of the dataset. Note: Spaces are not accepted. Once created, the name cannot be changed.NULL – A representation of an unknown value or the absence of data.  For Neuron, NULL values (values listed as empty string or "{data,data,,data }") are normalized in the following ways:
        • Ordinals (numbers): NULL values are normalized to 0.5, which translates to the average of the set
        • BOOLEAN: NULL values are normalized to 0 (False)
        • Nominal: NULL values are treated as just another value in the set and are treated as such.
    • Numbins – The number of bins values will be grouped into. Only supported for binned distributions.
    • Objective – A feature that is the dependent variable of a dataset or the topic of an investigation (See also: Topic, Goal, Dependent Variable).
    • Optimized – A read-only flag displaying whether or not a particular data set has been optimized.
    • Ordinal – A numeric data type (INTEGER or DOUBLE).
    • Outlier – A piece of data that falls significantly outside of the normal range of values (e.g., if the set of data is {1, 2, 3, 4, 5, 10000}, 10000 would be considered an outlier).  In a normal distribution, if a data point falls outside of ~2 standard deviations of a set, it's considered an outlier.
    • Over Sampling – A sampling technique used to account for very low occurrence of a specific example of an objective. Generally, if the desired objective occurrence is < 2% of the dataset oversampling is required.
    • Profiles – Identifies groups of records that over-perform or underperform. A profile contains individuals that are identical based on a set of attributes that generally over perform or underperform.
    • PopulationPercentage – Indicates the % of the audience population threshold that must be met for Neuron to return a population. If the minimum threshold is not met, Neuron will not return a profile for a given combination of factors. Note: Value must be greater than 0 and less than 1
    • PopulationCount – The total number of unique records matching this condition and any prior conditions in this profile condition set.
    • R, Pearson’s R – The measure of linear correlation between two variables where 1 is total positive correlation, 0 is no correlation and -1 is total negative correlation.
    • R^2 – A measure of how well a particular model fits or represents the behavior of the data.
    • ResultID – Unique identifier for the result set stored in Neuron for the job.
    • RMSE – A measure of the differences between values predicted by a model and the values observed.
    • Sampling – The process of selecting data for a Neuron dataset using randomization or another method.
    • SubClusterDescriptionModel – A nested structure of the cluster result set that encapsulates information for the sub-nodes of a hierarchical cluster.
    • Scoring – A Neuron operation used to rate or score a dataset against an existing Neuron model.
    • Score – The predicted value for the given record.
    • Signal(s) – Represent the strength or weakness a particular input (independent variable) has in relation to the Topic (dependent Variable). Signals vary based on the topic and the historical data available. Signals are categorized as below:
        • Very weak: Very Low MI
        • Weak: Low MI
        • Moderate: Medium MI
        • Strong: Somewhat High MI
        • Very Strong: Very high MI
    • SignalSetInfo – A collection of signals, one for each defined in maxAtATime.
    • (Neuron) Self-Service – A web-based platform to perform data analysis.
    • Snapshot – A point-in-time representation of the data in Neuron Self-Service used for an investigation.  A snapshot must have a topic assigned to it for Neuron Self-Service to use it as an investigation data source.
    • Time Horizon – A point in time where your predicted value begins to be explained by your dependent variable.
    • Topic – A feature that is the dependent variable of a dataset or the topic of an investigation
    • TotalGoal – The total (or sum) goal value for all members in the cluster that match the specified value. Total goal value is more appropriate for ordinal goals.
    • Training – A Neuron operation used to learn about a dataset and create a statistical model based on the factors contributing to the change in the dependent variable.
    • TreeDepth – Indicates the number of levels a profile search can go before Neuron stops searching. Note: Values must be between 1 and 10
    • Type – Indicates whether the feature/expression combination is to be considered an Inclusion or Exclusion condition. Allowed values are "INCLUDE", "EXCLUDE"
    • TreeId – The identifier of the tree to which a profile belongs.
    • Validation, training – A Neuron internal process used to validate and optimize the model. A random 10% of the training dataset is withheld from training and used to validate the model constructed.
    • Values – For ORDEREDSTRING, the sequential list order of the Values element will be used for interpretation of the order of the values for the feature. For STRING, if Values list is not empty, any String values not explicitly contained in the Values list will be rejected by Neuron. If, however, values list is empty, all values are accepted by Neuron.
    • Web Service(s) – A list of services used to interact with Neuron outside of the typical self-service model.