Groundwork, Concepts, Pre-requisites
Things to get ready before jumping into AI/ML.
- Learn Python fundamentals, preferably using Python3+. Definitely need a good understanding of datatypes, especially lists, dict, tuples.
- Learn the Pythonic way of coding
- Get familiar with Google Colab or Jupyter notebook
Basic concepts to understand before working with AI, ML
- While classic programming is getting answers by applying rules to data, machine learning as a process is combining data and existing answers to generate 'rules' that can then be applied to new data to derive answers.
-
Data Science, Deep Learning, Machine Learning, AI are (wrongly) used interchangeably.
These are separate, albeit overlapping concepts. Data Science is the umbrella term.
Machine Learning is one application of AI. Deep Learning is a machine learning category.
There can be AI outside the Data Science umbrella, but most applications of AI in IT are in the Data Science realm. - Regression vs. Neural Network models. If you are trying to uncover patterns / rules in a structured dataset, use regression models like GBRT. If you you are trying to identify patterns / rules in an image or text input, then Neural Networks are used.
Common terminology in ML (or How to sound like an AI ML expert)
- Features : For a tabular dataset this is just the columns. Basically features are the characteristics being fed to the ML model to help it predict something.
- Target : The property being predicted.
- Machine : This is the model being used to generate predictions
- Features with high correlation to target: The columns in the dataset that are most likely to affect the property being predicted. For example speed of vehicle might affect chances of it being in an accident.
- Training a model/learning a model: Training a model on a dataset
- GBRT: Gradient Boosted Regression Tree model - a model like XGBoost typically applied to structured datasets
- CNN : Convolutional Neural Network. Used for structured grid data - like images, videos.
- Classification vs. Regression problems: When your target prediction is a discrete value (specific color, one of five options) then it's a classification problem. If you are trying to predict a numeric value, it's a regression problem
- Flattening an image : Turning am image into a vector (one dimensional array) so stupid computers can digest the dense data of colors, shades, boundaries, objects that we absorb so easily when we see with our eyes.
Top priority areas in ML
XGBoost : for tabular data problems.Convolutional models: image problems.
Transformer models: text problems.
Hyperparameters
- Keras
- Keras Callback
Math
Dev Env
Models, practice
Popular Gradient Boost model: XGBoostPopular CNN models:
Thumb-rules, good practices, tips, recommendations
These are not rules, just suggestions and observations from people with experience
- In NN, how many parameters should you select for training? Data available should be 10x parameters. Training on too many parameters can lead to overfitting.
Awesome references
The AI-MLOps course by IISc (coordinated by Talentsprint) is a deep dive into applied AI, ML. Excellent for practitioners.The Neural Network Playground lets you play with four datasets. Try out different learning rates, activation functions, classifications and regularization. Useful to visually understand the effect of each parameter on the neural network.
The book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron (Author)
Machine Learning - the process
The high level process is:- Data collection and processing
- Feature engineering
- Data splitting
- Model selection
- Model training
- Model evaluation
- Final model selection
- Deployment
- Documentation and reporting
- Iterative improvement
XGBoost model training and evaluation Example
After you finish loading the dataframe, understanding the 'shape' of the data and checking out the records, remove the outliers.Then install and import xgboost:
pip install xgboost
import xgboost as xgb
Next, initialize and train the model using XGBClassifier:
xgb_classifier = xgb.XGBClassifier( objective='binary:logistic', n_estimators=100, max_depth=3,max_depth=3, learning_rate=0.1) xgb_classifier.fit(X_train, y_train)
Math foundation for AI ML development
Linear algebra and calculus help with the data science part of AI MLOps implementations. An understanding of probability and statistics is useful for model implementation. For example, if you want to predict the probability of rain or likelyhood of reaching a place through dense traffic, you need to understand probability and stats. Similarly if you want to predict the chances of a batsman scoring a century or being bowled out.Basic math concepts and terms worth understanding before jumping into AI:
- Mean,Average, Median, Mode, Quartiles.
- Data classification into deterministic variable, random variable, discrete values, discrete random variables and continuous random variables.
- Histograms, probability mass functions
- Variance, standard deviation
- Covariance and correlation
- Joins and intersections
- Sample space
- Probability Distribution Function (PDF), Probaility Mass Function (PMF)
Feature engineering. Example - converting 1-5 rating scale to Negative, Neutral, Positive ratings. Helps in turning a dataset into a uniform distribution.
One hot encoding.
Label encoding. Converting categories into numberic representations.
Statistical estimation.
Normal (Gaussian) distribution.
A scaled histogram of a scaled variable is the PDF of that variable.
How to find out how many records should be selected for sampling? Distribution does not change much by adding another data point.
Mean absolute deviation (M.A.D.)
Mean of square deviation (M.S.D. or variance)
Standard normal table
ML Models deep-dive, practice
Good reference:- Best for most: Linear/logistic regression, Decision Tree, Neural Networks, XGBoost, NaiveBayes, PCA, KNN, SVM,t-SNE.
- ML models for tabular datasets- a paper published by Intel suggests XGBoost is best for tabular data as of 2021.
- XGBoost alternatives: CatBoost , HistGradientBoosting , LightGBM .
Decision Tree
Data driven models for classification and regression. Decision tree models will give zero error on training data set. Powerful model for fitting capable dataset. Trained by greedy optimization algo called Classification and regression tree algorithm (CART). Decision trees work like recursive if-else conditions, eliminating branches based on the criteria separating one decision from another. For example to decide the specie of a flower, the data points considered mybe petal length, petal width, color. A decision tree will check for petal length condition, then go down a branch for which petal length matches, then check for petal width and go down the sub-branch. At one point the decision branches are exhausted and a final prediction is available.
Terms used in DT
- Gini, class, sample attribute, value attribute. Gini is impurity at node.
- Decision boundaries, leaf node, feature space.
Confusion matrix. Precision.Recall.F1 score. F2 score.
Root mean square error. Mean absolute error. Relative error. R2=1-MSE/Variance
K-fold cross validation. Hyper parameter tuning.
Logistic regression is the simplest neural network we can build, uses Sigmoid activation function.
What is a loss function?
Gradient descent algorith. Stochastic Gradient Descent Algorithm.
Activation functions - ReLU for dense layers. Sigmoid/softmax for classification. No activation for regression.
Good rules of thumb : use ReLU, five layers, all neurons available (validate this)