Which ML algorithm to use when?

We generally find this question while starting up new project or we want to compare some algorithms to discriminate them (discrimination helps understand things better sometimes).

Although there is no definitive answer to this, I am writing here summaries from some posts. [0]

 

machine-learning-cheet-sheet

 

Factors to consider

  • Accuracy
    • Most of people focus on accuracy but it practically it is not the only
  • Training time
    • Naive Bayes and logistic regression are much faster than boosting or neural nets
  • Linearity
    • LR and SVM are suitable when classes are linearly seperatble
      • Of course SVM bypasses it via kernel trick but still not as much complex decision boundary as nueral nets
    • Despite the risk of non linearity in data linear algorithms tends to work well in practice and are often used as starting point
  • Number of parameters
    • Parameters does affect training time and accuracy
    • More parameters helps learning complex function, however it requires more data to prevent over-fitting
  • No of features
    • When data point are not enough for no of features (text, NLP) SVM works well

 

Notes

  • Try out linear/logistic regression, SVM first when you most dependent variables are numeric.
  • SVM
    • SVM suites more when no of data points are less for given no of features.
    • SVM is linear classifier only. It just uses kernel trick to project linearly inseparable data on high dimension.
    • SVM is solved by mathematical optimization problem unlike nueral nets. Hence tends to be bit faster.
  • What is the difference between LR and SVM?
    • LR has linear decision boundary while SVM can have non linear decision boundary.
  • Reinforcement learning
    • Analyses and optimized behavior of agent, (via feedback from environment)
    • They try to discover different actions to maximize reward
    • Trial-error and delayed reward distinguishes reinforcement learning from other ML algorithms

 

References

[0] : https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/#prettyPhoto

[1] : https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

 

On Clustering

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

  • You need to specify value of k
  • Can cluster non-clustered data
  • Sensitive to scale
  • Even on perfect data sets, it can get stuck in a local minimum
  • Means are continuous
  • Hidden assumption: SSE is worth minimizing
  • K-means serves more as quantification

 

In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and lists which one suits when. http://scikit-learn.org/stable/modules/clustering.html

 

References:

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means

 

OpenAI Gym Environment

Open AI provides framework for creating environment and training on that environment. In this post I am pasting a simple notebook for a quick look up on how to use this environments and what all functions are available on environment object.

I have used environment available on github by Denny Britz and here are the references :

References :

https://github.com/dennybritz/reinforcement-learning

Learning Reinforcement Learning (with Code, Exercises and Solutions)

https://gym.openai.com/docs/

My Code : https://gist.github.com/arcarchit/2b3363e2615df7ef5c8d4941d4dfa9e8



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

gym_env.ipynb

hosted with ❤ by GitHub