I had zero knowledge about machine learning, but wanted to explore. I took up Large Scale Hierarchical Text Classification (LSHTC) as my Masters dissertation project, so that I have a good scenario to start Machine Learning
The first thing I wanted to know was the format of data provided by LSHTC. Turned out that it was SVM format. The training data and test data had the following format
label,label,label… feature:value feature:value
The label indicates the category the document belongs to.
The feature:value vector represents a word and its weight (TF) in the document.
Choice of programming language
Had to make a choice between Java and Python
I chose Python for the following reasons:
- Huge set of Machine Learning libraries — given that I was a beginner, this made a lot of impact. More libraries, more documentation, more examples => more experiments and better understanding
- Most of the Machine Learning this day is done with python
- Less cumbersome to try out a scenario — given that python is more of a scripting language, experiments could be made quickly especially with IPYTHON
- Also the hype around it these days :)
- scikit-learn — massive collection of different algorithms for Regression, Classification, Clustering, Dimension reduction, Model section pipelining etc
- mlpy — similar to scikit-learn but offers a smaller set
- graphlab — more of a recommendation engine
- Spark — very good parallel ML framework but still in its early stage. Does not offer many algorithms
I started off with sci-kit. It offeres a huge range of libraries & algorithms. I then had to do a lot of reading about the basic stuff in classification like Hyper planes, linear and non linear classification, K-Nearest Neighbours and Support Vector Machine (SVM) — What SVM is and why is it used?
The Stanford NLP book helped me a lot in understanding the basics of Classification
I’m an absolute beginner to Machine Learning and every algorithm I look at seems to be the right one. But only after experimenting each of them you know which is the best fit and why.
The problem I was solving was a medium scale data with 250,000 records of test data and 2 million records of training data. Both training and test data large number of features.
K Nearest Neighbour
Started of the first trial using K-nearest neighbour algorithm. Turns out, this is a very good algorithm but doesn’t scale well with larger data set. There are a number of flavors of kNN which reduces the dimenion of feature vector like — KD Tree, Ball Tree. But still doesn’t help much while running larger dataset which > 10000 records Also I used to frequently get the error “Core dumped” when I tried plain kNN and kNN with chi2 best selection. Still figuring out the reason; feel it doesn’t scale for larger dataset. But I get the same error for smaller datasets of 100 records which is weird and hints me that I might be doing something wrong! After reading a few articles I came to a conclusion that it is better to use SVM for large datasets.
Support Vector Machines (SVM)
Support Vector Machine is one of the fast and efficient learning algorithms for classification and regression. Works well on medium sized datasets. Linear SVM does a linear classification. We can define custom kernels for SVM. The SVM library in sci-kit offers commonly used kernels like
- Radial Basis Function (rbf)
The result with RBF kernel turned out to be bad. The prediction was pretty bad, got the same label predition for most of the test data.
I switched to linear SVM and the results turned out to be quite decent.
Given the problem is about Large Scale classification, scaling the algorithm to cater to large datasets is very important!
As of today, algorithms in sci-kit library run on single core. This turns out to be bad when running prediction on large datasets.
The way out is multicore processing by splitting the tasks. We can divide the task into sub tasks and run them on different cores. In my case, I split the test data into smaller subsets and predict them as different jobs, utilizing multiple cores. Sci-kit also provides a job processing library called joblib which enables the above mentioned process.
Soon we will run into problem having multiple copies of the training data on each job doing the prediction. To overcome this, joblib provides memory caching of functions. This helps us not to create copies, rather share the memory across all jobs. The problem seems to be solved, but it will not work when we have large enough dataset that needs to be run on different machines!