All open stadia sporting events are subject to the fickleness of the weather. Each sport has a different provision to address stoppage of play. Cricket – limited overs cricket, to be specific – employs a mathematical formulation called the Duckworth-Lewis (D/L) method [1] to predict the required target for the team playing second. Let us look at how Machine Learning (ML) techniques can be used to predict target scores using statistics from the first side’s play.

**Cricket 101**

If you’re familiar with the rules of cricket, feel free to skip over to the next section. For the rest, here’s a brief explanation about the rules of the game. There are three basic forms of international cricket matches: Test matches, Limited Overs One Day International (ODI) matches and the relatively new Twenty20 (T20) matches. We’ll consider the latter two forms of the game here. Each side consists of eleven players and takes turns in batting and bowling. Each side’s play is called an innings.

A pair of batsmen represents the batting side on the “pitch” or the “wicket”. A batsman faces “balls” or “deliveries” thrown on the pitch from the bowling side while the other batsman of the pair is at the bowling end of the pitch. The batsmen hit the balls bowled towards them and then run between the ends of the pitch. If the ball is hit beyond the ground with one or more bounces, it earns the batsman four runs whereas if the ball goes out of the ground directly it constitutes six runs. Throughout the innings the batsmen must preserve their “wicket” by not getting dismissed. There are a number of ways a batsman can be dismissed – getting caught out, bowled, stumped, hit wicket and so on – but we will not go into the details as that is not relevant to our discussion. Also, there are a number of ways in which the bowler can bowl an illegal delivery – a no ball or a wide, for example – which results in one or more extra runs being awarded to the batting side as well as an extra delivery to be bowled.

Since there are eleven players, there are a maximum of ten pairs of batsmen that can play per side. A bowler bowls six legal deliveries to constitute an “over”. The side batting first scores as many runs as possible before losing ten wickets in fifty overs in an ODI (or twenty over in T20). The objective of the game is for the side batting second to outscore the first side’s score by one or more runs with one or more wickets in hand before the end of play.

**Target Scores in Curtailed Matches**

Either side can be disadvantaged in case of reduction in play. If the first side played their innings without any stoppage, they would have planned their innings with access to all their resources. The batsmen playing up the order are generally better players who would face the best part of the opposition’s bowling and would therefore likely play conservatively in the earlier stages of the game in order to build the foundation of a long innings. The scoring picks up pace as the premium on preserving wickets diminishes with the progression of overs. It is generally likelier that the earlier part of the innings is slower relative to the latter part of the innings.

Conversely, if a side bats aggressively earlier on, it has a higher probability of losing wickets sooner and consequently failing to maintain the initial momentum at the end of the innings. If play is curtailed for the chasing side before it commences its innings, it starts with an advantage of more wickets in hand with a lower total to score in a shorter amount of overs. It has more wickets on average per over at the start of play and a lower premium on the preservation of wickets, so it is likely to play aggressively and disadvantage the first side. Alternatively, if the chasing side starts with the assumption that it will play a complete innings and if the innings is curtailed, it would face an unfair share of the better part of the opposition’s bowling and fielding restrictions. Its chasing strategy would also be based on the assumption that it will be facing a complete innings.

**Duckworth-Lewis Method and Limitations**

The D/L method is widely used in all limited overs international matches to predict the target score. It is statistical formula to set a fair target for the second team’s innings based on the score achieved by the first team. It takes into account the chasing side’s wickets lost and overs remaining. The predicted par score is calculated at each ball and is proportional to a percentage of the combination of wickets in hand and overs remaining. It also considers the scaled score as compared to 225 which is an average of 50 over scores in English Cricket Board (ECB) matches and ODIs [2]. While this number remains constant in calculating the par score by the D/L method, the average score in ODIs has increased throughout the past decade due to shifts in strategy, change in rules, equipment, influence of T20s et cetera.

Besides the wickets and overs remaining, there are other factors which affect the score. If there is a succession of wickets falling in the first half of the innings, the number of runs scored per deliveries faced or the average run rate is likely to fall as the batting side tries to consolidate its innings. However, towards the end of the innings the run rate may not decrease much or plateau, but increase in spite of falling wickets as the premium on wickets is lower at that stage. The batting side might also play more aggressively with depleting resources of the bowling side. Also, the D/L method does not account for changes in proportion of the innings for which field restrictions are in place compared to a completed innings.

Losing two or more wickets in quick succession from the top of the order is more detrimental to the innings than successive wickets from the tail end. Partnerships are also very important factor in the final score. If two batsmen are well set into a larger partnership and a wicket is lost, the final scores depends on the ability of the new batsmen to maintain or accelerate the run rate established by the previous batting pair. Apart from the available total deliveries, each side may face more than its allocated share of deliveries in the form of extras. One side may capitalize on this better than the other side and score more runs off the extra deliveries.

The D/L method places a larger emphasis on the wickets than overs without considering any of the other factors like effective run rate, partnerships and extras at different stages of the innings. One result is that sides batting first can game the system at the prospect of rain by playing defensively and at a slower rate knowing that all other things being equal, the chasing target will be higher if more wickets were preserved. Equally, the side batting second may be caught out playing defensively earlier on expecting a complete innings but instead has to chase a larger target if the innings is stopped. Considering all these factors, we see that it is therefore very difficult to predict the target chasing score considering only the wickets left and overs remaining.

**Predicting Target Scores Using Linear Regression**

Reducing the objective of the chasing side to a purely mathematical perspective, we can say that it needs to play exactly like the first side except that it must be infinitesimally ahead at the close of play. Lets say we have a multidimensional equation representing how the first side played with each dimension representing a different factor which contributed to the final score.

a₁x₁ + a₂x₂ + a₃x₃ + … +aᵢxᵢ = y

At each point in the second innings’ play, we can use this equation to predict what the first side would have scored had it played exactly like the second side. Note that the coefficients can have both negative and positive values. If the first side played well, each of the *aᵢ *coefficients will have a higher value so the predicted score from this equation for the second side will be higher than its actual score and it will be behind the target. If the chasing side plays marginally better, the predicted score will be slightly higher but if it plays much better than it will eventually *surpass* the coefficient values and the predicted target to chase will be lower than its actual score. If the first innings was unstable or very defensive with a high number of wickets left but a lower total, the calculated coefficients will be lower and it would take lesser effort for the chasing side to surpass these values. This equation ensures than the chasing side will have to play better than the first side irrespective of the strategy or the total of the first innings.

We use the ML technique of Normalized Linear Regression with Regularization to predict a vector of theta values i.e. the coefficients of this multidimensional equation. To obtain the equation, we use the following data from the first innings: the number of deliveries bowled (including the number of extra deliveries), wickets left, extras, effective runs scored per deliveries faced (net run rate including extra deliveries) and partnership since the last wicket (xᵢ) and compare that against total runs scored (y). Since the values are on different scales, the mean is calculated for each column and the data is normalized. The data is run through a range of regularization lambda values to find the point of least error in NLR cost function. Once this equation is obtained, predicting the target score is just a matter of plugging the available stats from the second innings into this equation.

The number of wickets left and partnership scores are considered in a way so as to penalize wickets at the top of the order in the beginning of the innings more than those at the bottom of the order or towards the end of the innings. Only the first innings’ data for the current match is used to calculate the LR equation as opposed to other ML techniques which use all prior data available. The reasoning behind this being that different matches, even those between the same sides, are considered mutually independent events. There are various other unquantifiable factors like the location, state of the pitch, changing temperatures, dew and humidity, composition of the side and so on that cannot be used uniformly in any mathematical formulation. Also, on a particular day any side can outperform or under-perform so it would be unfair to consider prior data unrelated to the present match.

**Test Results**

This method was run on a number of ODI matches – complete and curtailed – and some of the results have been illustrated below. A large percentage of the match scores were tested from the 501 limited overs match data available. Out of this set, all 51 matches featuring the D/L method were tested. The vast majority of test results on ODI data were very good. There was a small subset – eight matches – for which the predicted targets were skewed through the innings. The test results for T20 matches were inconclusive and we shall not consider using this method for the T20 format for now.

Note that the initially predicted target score may not be in the range of that obtained through the D/L method but the target varies after each delivery bowled. If the chasing side is scoring at a much faster rate and losing wickets infrequently, the loss of wickets will not raise the predicted target score by a large amount. Alternatively, if the effective run rate is increasing very slowly or decreasing without a significant partnership, then a fall of wicket is penalized more heavily. Though unintuitive, it is quite possible and it was seen in some instances that a fall in wicket of the chasing side actually resulted in decreasing the target score. This is because the equation predicts how the first side would have reacted had it lost a wicket when playing with the statistics of the chasing side.

###### Unanticipated Reduced Innings: NatWest Series 2011 [3]

Here is the simplest case: the first side plays a complete innings but the chasing side is given a revised target with reduced overs to chase. India played their complete allotment of overs and scored 304 runs losing six wickets. England were set a target of 241 from 34 overs. Applying the ML NLR approach, England would have been given an initial target of 193 instead. We see from the graph below that England were ahead of the predicted target up until the point they lose their first wicket at 60. Thereafter the target increases and the slowing run rate between the 10th and 20th over puts the target well over the actual score. At the end of play England were at 241 for four wickets but with the NLR equation the predicted winning target would have been 286 instead.

###### Anticipated Reduced Innings: West Indies Tri-Nation Series 2013 [4]

This is an interesting one. India were batting first against Sri Lanka and were expecting the innings to be cut short. Knowing that the D/L system favors saving wickets in the first innings, they played defensively and setup a total of 119 runs with seven wickets left in 29 overs. Sri Lanka were given a target of 178 runs from 26 overs. According to the ML NLR approach, Sri Lanka would have been given an initial target of 112 from 26 overs. If their innings had progressed exactly as it did, we see that the revised target was manageable until much further in the innings. However towards the end of the innings the run rate didn’t increase enough with wickets falling regularly so the target score escalated quickly and was beyond reach at the end of play.

###### Both Innings Reduced: Champions Trophy Final 2013 [5]

A particularly telling result was the match comparison of the Champions Trophy 2013 final between India and England. Note that both sides were allotted twenty overs each and the D/L method was not used in this match. However, the progression of the match is very interesting so we’ll look at it here. If we were to use the ML NLR method to calculate a target score for the second innings, it would be set at 125 runs from 20 overs.

Looking at the data from the England side’s data, initially the target score increases slowly as there are wickets in hand with stable partnerships in the middle of the innings and a high net run rate at that point. Slowly the actual score surpasses the target score in this middle phase. As soon as wickets start falling in quick succession and the run rate reduces, the predicted target score increases and eventually surpasses the actual score.

###### Unstable Chase in Second Innings: Bangladesh in Sri Lanka ODI Series [6]

Lets look at an extreme example where the first innings is complete and has a high total built solidly but the chasing side has an unbalanced and reduced innings. Sri Lanka played a complete first innings to put up 302 runs with one wicket remaining in 50 overs; Bangladesh were set a target of 184 runs from 27 overs with the D/L method. Bangladesh chased down the score in 26 overs with three wickets in hand. With the ML NLR technique however, Bangladesh would have been set an initial target of 150 runs from 27 overs but we see here that the chasing run rate increases very rapidly. Comparing the two innings, Sri Lanka lost their first two wickets at 116 and 203 runs in the 22nd and 36th overs respectively whereas Bangladesh lost wickets at fairly regular intervals after the first wicket on 77 and had only three wickets remaining at the end. The predicted target score at the end would have been 289 which is a rather extreme interpretation of the equation; more about this later.

**Conclusion**

The ML technique of Normalized Linear Regression with Regularization can be used to calculate an equation representing the first innings’ play. The target score for the chasing side is the predicted score that the first side would have made had it played exactly like the second side. If the first side plays defensively, the second side can play a similar strategy and not be penalized for it. Similarly, if the first team played a complete innings solidly – aggressively and defensively at different times without destabilizing the innings at any point – the second team has to play similarly or better else it is penalized with a higher target.

Recent advances in computation and Machine Learning algorithms make it easy to process and scale large quantities of data very efficiently and obtain a relationship between multiple features. This solution demonstrates that a number of features quantifying the state of an innings in a cricket match can be used to more accurately predict the target score. This is a proof of concept and is not optimal. It was seen from the final example above that if there is a much higher importance on the wickets and partnerships then there is a possibility of seeing a skewed target on some occasions. There may be other features which that can be quantified and added. Similarly, existing features can be modified or removed entirely to improve the results to ensure that the LR implementation would obtain better and more consistent results.

## Credits

Andrew Ng’s Machine Learning course was tremendously informative and set me on the path to create this system. Check out the course and others on the knowledge trove Coursera. [7]

Obtaining any meaningful data comprises a largest part of any Machine Learning problem. All cricket data was obtained from Cricsheet. [8]

## Code and Errors

All code is accessible on my GitHub account. [9] As stated earlier, the purpose of this demonstration is the use of Machine Learning to better predict targets of cricket matches considering a number of factors in addition to the overs and wickets remaining.

The efficiency and accuracy of the scripts and programs can be improved. The list of match numbers where the test results were skewed is as follows: 352668, 366624, 474470, 489222, 560923, 578623, 582188 and 602476.

**Links and Resource**

[1] http://en.wikipedia.org/wiki/Duckworth%E2%80%93Lewis_method

[2] http://static.espncricinfo.com/db/ABOUT_CRICKET/RAIN_RULES/DUCKWORTH_LEWIS.html

[3] http://www.espncricinfo.com/england-v-india-2011/engine/match/474481.html

[4] http://www.espncricinfo.com/tri-nation-west-indies-2013/engine/match/597928.html

[5] http://www.espncricinfo.com/icc-champions-trophy-2013/engine/match/566948.html

[6] http://www.espncricinfo.com/sri-lanka-v-bangladesh-2013/engine/current/match/602476.html

[7] https://www.coursera.org/#course/ml

[9] https://github.com/nileshkaria/MLCricketScorePredictor

© All rights reserved by Nilesh Karia and neosaurus.wordpress.com, 2013. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Nilesh Karia and neosaurus.wordpress.com with appropriate and specific direction to the original content.

Disclaimer: I know nothing about cricket.

Your features (variables) are “the number of deliveries bowled (including the number of extra deliveries), wickets left, extras, effective runs scored per deliveries faced (net run rate including extra deliveries), partnership since last wicket and total runs scored”.

Is it possible that your predictor variables suffer from multicollinearity? If so you may need to perform principal component analysis.

The parameters were selected such that they didn’t suffer from multicollinearity. The number of runs scored is the y value in the equation.

The post wasn’t clear on that and it has now been updated. Thanks for pointing it out!