Machine learning requires careful thought and planning. Start by analyzing your ML workflow—what you want your project to do, and how you will reach your destination.
Because machine learning (ML) is hot right now, you can easily find a lot of information about it online. However, ML is an art and a science, and the best way to learn it is by taking on a small, low-risk project. Don’t damage your reputation by learning just enough about ML online, only to cause problems for your client or employer down the road.
To get started, have your destination in mind and map out how you will get there. We call these steps the ML workflow, and they require understanding some key terms and basic concepts. Here is a short glossary:
- Features/Inputs: Individual independent variables that act as inputs in an ML system. Features are also called attributes. Feature engineering is the process of obtaining new features from old features. In simple terms, you can consider one column of a data set as one feature.
- Labels/Output Classes: The final output of an ML system. You can also think of output classes as labels. Labeled data refers to groups of samples tagged to one or more labels. Programmers apply both features and labels to classification and regression problems.
- Algorithm: The hypothesis set before training starts using real-world data. Algorithms adjust themselves to perform better by experiencing more and more data. In other words, they learn through exposure to more data over time, just like humans.
- Training/Learning: The process through which data passes through the algorithm. The algorithm looks for patterns in how features and labels correspond. The training process results in a model.
- Models: A piece of code programmers make better and smarter with the help of trained data. It can be a mathematical representation of a real-world process. Analysts can then use the representation to make prediction or inference models, which indicate the most likely outcomes based on relationships between features and labels.
We can see how all of this works together in a simple image:
Machine Learning Planning: Thinking It All Through
As impressive as this all might seem, don’t forget you need human thought to make it all work. Programmers start the process by asking three questions about the problem they want to solve:
- Do I have a well-defined problem to solve?
Not every problem lends itself to ML. The best automation problems are those that involve repeatable, rules-based activities. Problems you don’t need to repeat, require quick completion, or rely on human intuition are generally not good candidates for ML.
- Is ML the best solution for the problem?
As Gartner points out, this question is critical because if data inputs don’t match the defined dataset, the model will fail. Think about comparing delivery route data if you measure distances in one dataset in miles and the other in kilometers.
- Do I have a way to measure my Model’s success?
Make sure you have access to enough data to train your model. One of the most common reasons ML projects fail is the lack of enough data. Data quantity is a better predictor of ML success than data quality.
The Machine Learning Workflow
The diagram below illustrates the ML workflow. As you can see, it is a straightforward process that starts with three phases: sourcing and preparing data, coding the model, and training, evaluating and tuning the model. The last phase begins an iterative process as the algorithm continues to adjust to the new features it generates.
Source and Prepare Your Data
To succeed, you need a large set of training data that includes the feature you want to predict based on the other features. For example, suppose you build your model to predict the sale price of a house. Preparing the data includes:
- Data analysis: Once you source your data, analyze and understand it, and prepare to run it through the training process. In the house example, features might be location, size of the house and price.
- Data preprocessing: Transform the valid, clean data into the format that best suits your model’s needs. You would want all your sales prices to be in the same currency type, for example, and size must be a standard measure like square footage, rather than the number of rooms. (Room size and function among houses will vary, but the house’s total square footage will not.)
- Data exploration and preparation completion: Your goal in this straightforward example would be to produce a model that will accurately predict the price of a house as its size and location features vary in relation to each other.
Code Your Model
Develop your model using established ML techniques or by defining new operations and approaches. One of the most common programs used for ML is called Python. While a deep dive into Python is beyond our scope today, your first step is to learn basic mathematical concepts, like linear algebra.
One good course is Andrew Ng’s Machine Learning on Coursera. In addition to linear algebra basics, it will introduce you to the best machine learning techniques and give you hands-on practice applying them to real-world problems. You’ll then be ready to explore the various Python packages, including NumPy, Matplot, Pandas and Scikit Learn.
But again, while you can learn basic elements of ML and Python online, online training is no substitute for learning through experience, starting with a small, low-risk project.
Train, Evaluate, and Tune Your Model
Now, you are finally ready to train the model with your training data and evaluate how well it performs. Remember, the larger your data set, the better: More data will help the model work more efficiently, and so you can validate it for further testing and tuning.
ML has tremendous potential for improving countless areas of our lives. However, to succeed, you must start by understanding what you want your ML project to accomplish and how you will get it to the finish line. Making sure that ML is the best solution for your problem and that you have the amount of data you need before you start is the best pathway to ML success, no matter what program or technology you use.