5
Identifying What Matters Most – Intuition Behind Principal Components, Factors, and Optimization

AI scientists tried to program computers to act like humans without first answering what intelligence is and what it means to understand. They left out the most important part of building intelligent machines, the intelligence! ‘Real intelligence’ makes the point that before we attempt to build intelligent machines, we have to first understand how the brain thinks, and there is nothing artificial about that. Only then can we ask how we can build intelligent machines. 

– Jeff Hawkins, On Intelligence (Times Books, 2007)

Principal Component Analysis and Its Applications

Even without Machine Learning, there are some very powerful tools to identify relationships between things, if those relationships are “linear.” Linear means what you learned in high school, such as if 10x – 7y = z, then z depends linearly on the variables x and y because a change in one of them gives a proportional change in z.

In algebra, you get to solve the equation – given a couple of points (x1, y1, z1) and (x2, y2, z2) you can solve the equations

ax1 + by1 = z1

ax2 + by2 = z2

for a and b and you’re done!

Unfortunately, real life is quite a bit messier than high school algebra. In real life, you don’t have exact relationships, you have approximate relationships, and the data is fuzzy. Even if things are related linearly, there is still a lot of “noise.”

But we can still find the “best” approximate linear relationship, in a few different ways.

The simplest case is where you have a bunch of independent variables like “x” and “y” and a dependent variable like “z.” In that case, you have the following silly scheme:

  1. guess a and b, which are supposed to give a prediction for z of the form ax + by + c = z
  2. calculate the “error” e for each data point (x, y, z), meaning how much the prediction is off by:

    e1 = z1 – (ax1 + by1 + c)

    e2 = z2 – (ax2 + by2 + c)

    e3 = z3 – (ax3 + by3 + c)

    etc. until you’ve done all of the points. How many points, kids? “n” of them! You learn fast.

  3. If the total error is too much, guess again.

Of course, we can do better using calculus. There is a method called least-squares regression which, given all the data points, “automagically” finds the best linear model (that is, the best choices for a and b in the above example), where “best” means the smallest possible value for the sum of the squares of the errors (e1^2 + e2^2 + . . . + e_n^2).

Why do we use squares rather than the total error given by adding up the errors? Well, if some are positive and some are negative, they could add up to zero. Squaring is a way to make them all positive, so the math works out nicely. But it also makes the judgment that big errors are relatively more of a problem than small errors, which means that an outlier (a far-off data point, like if someone lied or made a typo or was a mutant) can ruin your whole day.

That’s okay, because there are other methods that minimize the error in different ways. So we can pick one we like, and it works just as well for more independent variables (inputs) and more dependent variables (outputs).

The tools here come from a branch of math called Linear Algebra and the main gadget in this field is called a matrix. A matrix is just a box of numbers summarizing some linear relationships. For example, we can represent the system of three equations

a11×1 + a12×2 + a13×3 = y1

a21×1 + a22×2 + a23×3 = y2

a31×1 + a32×2 + a33×3 = y3

by the single “matrix equation” AX = Y,

where A stands for the 3×3 box

a11 a12 a13

a21 a22 a23

a31 a32 a33

and X is the 3×1 vector

x1

x2

x3

and Y is the 3×1 vector

y1

y2

y3

(Note: there are some tricky things about matrix “equations.” The dimensions have to match up in a certain way, and multiplication isn’t commutative – sometimes AB and BA are different!)

In addition to solving equations, which was their original use, matrices (that’s the plural, don’t ask why) can be used to represent transformations of data and movements in space, even space with a lot more dimensions than we can see or even imagine (cue Han Solo voice: I can imagine an awful lot . . .). Some of the more advanced tools in linear algebra solve systems of inequalities as well as systems of equations – this is called “linear programming.” For example, buying ingredients at the supermarket that will allow you to meet all nutritional requirements at the least cost is such a problem, and it involves using matrices to define multidimensional “hyperplanes” in a multidimensional “solution space” and finding the point in the region bounded by the hyperplanes where some linear function has the best value, usually by starting at a corner and traveling along the edges in a manner similar to the “gradient descent” method we discussed in Chapters 3 and 4.

Still, we’re missing something. What if the z doesn’t really depend on the variables directly, or if they’re not really independent? Can we understand things better by being smarter?

Yes. That’s what Principal Components Analysis (PCA) does. Given a bunch of data, it doesn’t just find the best numbers to multiply some of the variables by to get the rest of the variables (that’s what “regression” does). Instead, it finds better variables. It looks for linear combinations of the existing variables that are more informative.

Suppose you have a bunch of statistics about baseball players (forget pitchers because they’re a different breed of cat): batting average, slugging average, extra base hits, walks, strikeouts, RBIs, homers, stolen bases, runs scored, errors, putouts, fielding chances, total at-bats, and so on. Those things are redundant and related to each other, but some of them are more useful than others. If you put them all together and did a Principal Components Analysis, you’d find a single combination of them that explained the most, which would sort of be considered “how good the player is.” It would be negatively correlated with errors and positively correlated with everything else.

After you allowed for the redundancies, you would find that there were other components that described the player in other ways than “how good he is.” For instance, some players have a lot of runs and stolen bases, while others have lots of homers and RBIs.

The Principal Components Analysis will find the combination to explain as much as possible of the variation that the first combination couldn’t explain – you might call that “hitter type” and it would be an axis with sluggers at one end and speedsters at the other, depending on the weights it gave to the different stats.

Mike Trout (the best baseball player today, in case you didn’t know) would be good at everything, so he would score very high on the first axis but the second axis is by construction uncorrelated with the first axis, so it measures something different – not exactly “hitter type” but more “hitter type relative to value,” because sluggers are worth more than speedsters, and that part already got captured by the first component.

In terms of linear algebra, you find the “covariance matrix” between pairs of variables, and then the “eigenvectors” and “eigenvalues” of the matrix. The eigenvector for the biggest eigenvalue is the first principal component, and so on. What is an eigenvector? It’s a solution to the matrix equation AX = vX, where a vector X is, when transformed by the matrix A, simply turned into a constant “scalar” multiple of itself. The scalar quantity v is the eigenvalue.

But the baseball stat geeks, starting with Bill James, did an even more sophisticated kind of analysis called “factor analysis,” which is a lot like PCA except, in Machine Learning terms, it was supervised rather than unsupervised. The standard stats are visible, but they theorized that underlying them are some invisible factors that really represent a player’s true value more accurately. Instead of straight combinations of the stats you have, they are qualities more like strength, speed, coordination, and maybe something that could go by the general term of “baseball smarts,” or decision-making.

The sluggers would probably have a lot of strength, the base-stealers would have a lot of speed, the players with high batting averages might have good coordination, and so on. The difference is that the person doing the modeling gets to make choices that influence how the variables are combined, rather than getting automatically constructed combinations each of which is explicitly built to be independent of the previous ones. The factors here won’t be independent. Strength and speed are both positively influenced by general fitness or athleticism, for example.

The math for both of these is pretty similar, and involves the same kind of linear algebra as regression analysis, but done in stages. Linear stuff is relatively easy to model.

If you are a baseball fan, you may have seen the movie Moneyball, about how Bill James and the other math nerds revolutionized the way baseball used statistics, which led to significant changes in how the game was played. They used tools like the ones described here to figure out (much better than “conventional wisdom” had done) what attributes of players and what strategies of play contributed the most to success on the field. The same thing was then done for other sports by imitators, but nothing compares to baseball for the sheer overwhelming volume of the statistics generated – if you’re a Big Data guy, that’s the sport for you.

For a preview of how complicated it can get, see here:

https://techofcomm.wordpress.com/2015/10/14/what-sabermetrics-and-baseball-analytics-want

Concerned Parent: “If all your friends jumped off a bridge, would you follow them?”

Machine Learning Algorithm: “Yes.”

– @computerfact

Intuition Behind Rule-based and Fuzzy Inference Engines

All men are mortal.

Socrates is a man.

Therefore, Socrates is mortal.

That’s the classic example of a syllogism, which is a simple kind of logic. Logic basically means correct verbal reasoning.

For example, in logic we use ordinary words like “and,” “or,” “not,” “implies,” and “equivalent” with very precise meanings, to combine propositions (which are statements that can be true or false) into other ones. The method of “truth tables” defines this part of logic:

P Q not P P or Q P and Q P implies Q P iff Q
T T F T T T T
T F F T F F F
F T T T F T F
F F T F F T T

Here T is “true,” F is “false,” and the only hard thing is understanding “implies.” Let P = “It is raining,” Q = “I will take my umbrella when I go out.” Then “P implies Q” is “If it is raining, then I will take my umbrella when I go out.” But what if it isn’t raining and I take my umbrella anyway? Well, I didn’t say I wouldn’t! Maybe it’s stylish or I want to be safe if it rains later or something, but the fact that I took the umbrella when it wasn’t raining DOES NOT CONTRADICT the statement I labeled “P implies Q,” so we count it as satisfied and true if P doesn’t actually happen.

Of course, it gets complicated quickly. With four letters P, Q, R, S, to represent propositions, we have 16 different combinations of True and False (TTTT TTTF TTFT TTFF TFTT TFTF TFFT TFFF FTTT FTTF FTFT FTFF FFTT FFTF FFFT FFFF). If we wanted to check whether something like “(P and Q) implies R, or not-R and S, or not-S and P” we have a lot of cases. But maybe we can use shortcuts by having some template sentences that we know are always true and stringing them together to make a proof rather than constructing a truth table. For example, “((P implies Q) and (Q implies R)) implies (P implies R)” is always true no matter what the truth values of P, Q, R, and S are (propositions like this are called “tautologies”), and now that we know “implication is transitive” we can shorten a lot of other proofs and possibly avoid the exponential blowup of truth tables. (If you remember about P and NP, it is not considered likely that you can ALWAYS get short proofs this way because that would mean P equals NP, but you often can.)

Back in the old days, before we understood that the hard stuff is easy and the easy stuff is hard, there was a kind of AI program called an “expert system.”

Expert systems assumed that the performance of human experts could be simulated by figuring out which logical rules they were using: in other words, what terms, what the definitions of those terms were, and what specific core assumptions applied to the domains they were experts in.

In a few areas, this actually worked okay. But they were the boring areas. We won’t summarize logic here, but basically the systems were expected to work in a domain that contained “objects,” and relations between those objects, and postulates, which were known or assumed to be true, and inference rules, which did what logic does: take you from true things you know to more true things that you weren’t aware of.

(Technically, if they were consequences of things you knew already, you sort of knew them too. That’s what Socrates was so annoyingly showing all the time, until the Athenians got so fed up they decided to demonstrate that the conclusion of the introductory syllogism was true without needing the first two assumptions.)

Example: The patient has symptom A. Possible causes of A are W, X, Y, and Z. Test B rules out W and X and is consistent with Y and Z. Test C rules out W and is consistent with X, Y, and Z. Therefore, perform test B because it tells you more than test C about the current patient (Test C, unlike Test B, can help distinguish between S and T, but we don’t care about those because they don’t produce symptom A). If Test B rules out W and X, then the patient has Y or Z, and the following steps are indicated . . .

Unfortunately, it turned out that human experts had a lot of intuitive types of thinking going on that they liked to reduce to rules after the fact to rationalize what worked, and the expert systems didn’t always capture them, so they had trouble when moving to more complex domains. You can get a robot to navigate a maze by rule-based logic, but plop it down in the middle of a crowded city plaza and it won’t know how to navigate using simple deterministic rules. It doesn’t know what it doesn’t know and those “unknown unknowns” can kill it.

If you don’t include something in your model, that doesn’t mean it doesn’t exist. Nassim Taleb popularized the idea of a “black swan event” – something important but rare that you had no idea was even possible because your model didn’t include it, and you weren’t around the last time it happened, if it ever did.

This is not to disparage the importance of rule-based logic. What it does, it does very well, and if you want to establish a mathematical theorem or convict someone in court, you can’t do without it. Machine Learning systems will ultimately need it if they are going to survive being fed a diet of internet comments from people who commit every fallacy in the book.

The next step was “fuzzy logic.” This was an attempt to deal with non-deterministic situations where you either don’t have all the information you need, or the future has some randomness to it.

Robot poker players became good because they were designed to assign to statements “truth values,” which were probabilities between 0 and 1 (0 always false, 1 always true), so for example you could have a rule “if the other player opens with a maximum raise, and you have a middle-value hand, the probability he has a better hand than you increases from 0.5 to 0.7, and the probability that he has a bad hand and is bluffing is 0.2,” or something like that.

Fuzzy logic can deal with probabilistic situations while applying rules consistently; the effectiveness of Inference Engines (which are machines that generate statements believed to be true, based on logic applied to previous statements and data) ultimately depends on Bayes’ theorem, which gives the correct way to update estimated probabilities based on evidence. But it is still easy to get things wrong: probability is tricky! There have been a number of cases where juries convicted innocent defendants, or acquitted guilty ones, because a smart lawyer tricked them with statistics and the judge wasn’t smart enough to notice. A famous example: A man was murdered in California and witnesses described the couple responsible – the man and the woman were of certain ages and had certain combinations of hair, skin, and eye color, and they drove a particular make and model of car. Detectives scoured the records until they found such a couple, and prosecutors explained that the chance was 1 in a million that a couple would satisfy all of those criteria, therefore they were guilty beyond a reasonable doubt!

But of course there were 10 million couples in the state of California at the time, so there were probably about 10 couples that fit the description – therefore the chance that the first couple the police found fitting the description was guilty was much closer to 1 out of 10 than to 999,999 out of 1 million! They weren’t a random couple; they were a random couple fitting the description police were looking for.

Progress has been made with rigorous probabilistic reasoning, but it hasn’t turned out to be the most promising method for AI.

When asked about the next big marketing trend, survey respondents identified consumer personalization (29%), AI (26%), and voice search (21.23%). These top three responses, which total 75% of all AI applications, demonstrate that AI is more pervasive and prominent than respondents realize.

– “2018 Future of Marketing and AI Survey,” BrightEdge

Intuition Behind Genetic Algorithms and Optimization

How do we know Artificial Intelligence is possible?

Well, we’re intelligent, and we can reproduce, so we can make more smart things. Paging Scarlett Johansson . . .

That’s not really artificial, although DNA technology is advancing, but it is a “proof of concept.” Smart things exist, and we can make copies of them. And if that’s too hard because brains are not only really squishy but also the parts are really really tiny and if we take them apart they don’t work any more, we can copy the process that made them. That is, Mr. Darwin’s process: evolution via selection.

Not exactly natural selection, because we don’t want to wait millions of years for this to happen. But Darwin was inspired in the first place because he noticed how much we had been able to change dogs and other domestic animals and plants (it’s especially obvious with dogs) by artificial selection.

We picked organisms which did stuff we found useful, or looked some way that we liked, and let them breed some more. This was slow and had some problems (such as hip dysplasia in extra large dogs, for instance), but the idea is extremely powerful.

The idea is, if you want a computer to solve a problem, make sure it can evolve. That is, its program can change. Then all you need is a performance measurement. If you can tell the difference between when it is doing better and when it is doing worse, then you let the ones that do better have more influence on the next generation.

Although it isn’t usually expressed this way, this is already what happens when you train neural nets with a Machine Learning method. The weights of the connections between the neurons change, in a direction that will improve the performance on the training data. Of course, the overall program is still the same. It’s more like changing the values of some constants, but one of the lessons of computer science going all the way back to Turing and von Neumann is that ultimately, there isn’t any real difference between “programs” and “data.” Changing the neural weights can accomplish the same thing that a change in the program would, for sufficiently complicated networks.

The only reason we don’t usually talk of this in evolutionary terms is that we already have a better model for computer evolution: genetic algorithms. Here, the idea is that the actual source code of the program (meaning words in the programming language that follow a grammar and logic, not numbers representing weights or constants) is the thing that will evolve. The same word, code, is used both for the DNA sequences that tell our cells what proteins to produce, and the sequences of computer instructions. But the idea goes back to Turing himself, in 1950, even before Watson and Crick discovered the role of DNA a couple of years later, and by the 1960s many researchers had advanced the idea.

The way DNA works: it is (to a first approximation, there’s other stuff going on too) a linear sequence of chemicals each of which is one of 4 standard chemicals we label A, G, C, and T. There is machinery in the cell, which maps the 64 combinations of 3 letters in a row either to one of 20 amino acids, or to punctuation that controls the overall activity. The sequence of amino acids is assembled and automatically folds itself into a protein that does something useful, usually. The more you study this the more amazing it gets!

How do genetic algorithms work? Remarkably primitively, in fact. You create a simple programming language that does things related to the problem you are trying to solve, and write code for some little bots that live inside the computer’s memory and do the things, and then you follow the steps below.

  • Mercilessly delete the ones that did the things less well.
  • Randomly change some of the ones that survived.
  • Randomly mix together code from the ones that survived.
  • Lather, rinse, repeat.

This sounds ridiculously inefficient but guess what? Evolution is ridiculously inefficient, yet here we are. One big advantage is that if you can make each generation a million times faster than actual carbon-based earth life forms need, and if Moore’s law has allowed you to have computers whose memory and speed has steadily increased for several decades, there is room and time for lots of crazy stuff to happen.

One interesting successful example of genetic algorithms involved a game called Core War, created in 1984 by D.G. Jones and A.K. Dewdney, in which a tiny programming language was created that allowed programs written in it to modify the area of the computer’s memory in which the programs themselves were running, with the goal of taking over more and more of the segment of memory dedicated to the game. Some of the winning contestants did not write the programs that finally won, but rather created programs that could evolve to improve. Read all about it at https://en.wikipedia.org/wiki/Core_War.

Often we don’t understand why the new bots do better than the old bots, but they do, and sometimes we can see why. The results can be pretty creepy, like the times circuit design projects used genetic algorithms and found that the winners exploited physical properties of the electronics that they hadn’t even known about to create clocks and capacitors not included in their toolkit, by “cheating.” The following is taken from Nick Bostrom’s book Superintelligence, which is about what can happen if we screw up building AIs (spoiler: you don’t want to know):

Even simple evolutionary search processes sometimes produce highly unexpected results, solutions that satisfy a formal user-defined criterion in a very different way than the user expected or intended.

The field of evolvable hardware offers many illustrations of this phenomenon. In this field, an evolutionary algorithm searches the space of hardware designs, testing the fitness of each design by instantiating it physically on a rapidly reconfigurable array or motherboard. The evolved designs often show remarkable economy. For instance, one search discovered a frequency discrimination circuit that functioned without a clock – a component normally considered necessary for this function. The researchers estimated that the evolved circuit was between one and two orders of magnitude smaller than what a human engineer would have required for the task. The circuit exploited the physical properties of its components in unorthodox ways; some active, necessary components were not even connected to the input or output pins! These components instead participated via what would normally be considered nuisance side effects, such as electromagnetic coupling or power-supply loading.

Another search process, tasked with creating an oscillator, was deprived of a seemingly even more indispensable component, the capacitor. When the algorithm presented its successful solution, the researchers examined it and at first concluded that it “should not work.” Upon more careful examination, they discovered that the algorithm had, MacGyver-like, reconfigured its sensor-less motherboard into a makeshift radio receiver, using the printed circuit board tracks as an aerial to pick up signals generated by personal computers that happened to be situated nearby in the laboratory. The circuit amplified this signal to produce the desired oscillating output.

In other experiments, evolutionary algorithms designed circuits that sensed whether the motherboard was being monitored with an oscilloscope, or whether a soldering iron was connected to the lab’s common power supply. These examples illustrate how an open-ended search process can repurpose the materials accessible to it in order to devise completely unexpected sensory capabilities, by means that conventional human design-thinking is poorly equipped to exploit or even account for in retrospect. (Nick Bostrom, Superintelligence: Paths, Dangers, Strategies (Oxford University Press, 2014), 154)

Obviously, most random changes to a program will break it, or at best do nothing, just like random changes to our DNA are usually not beneficial mutations. Here is a quote from the well-known philosopher Iosif Vissarionovich Dzhugashvili:

“Quantity has a quality all its own.”

61%, regardless of company size, pointed to machine learning and AI as their company’s most significant data initiative for next year.

– “2018 Outlook: Machine Learning and Artificial Intelligence,” blog.memsql.com/2018

Intuition Behind Programming Tools

“So how do I get started with this AI stuff?”

Well, it depends what you know already. There are lots of good tools out there, requiring different levels of background. The big division is whether you already know how to program or not. If you’re not sure whether you know how to program or not, then you don’t.

This gives us three sub-questions:

  1. How do I do AI if I don’t know how to program?
  2. How do I do AI if I do know how to program?
  3. How do I learn how to program so I can do AI better?

Let’s answer the last one first. The way you learn to program is by signing up for a course in it, and/or by doing it. But it doesn’t actually matter much which course, or how you start, because with computer languages, just like with human languages, despite enormous superficial differences, at a deep enough level, they’re mostly the same.

This is even a theorem. The oldest programming language that is actually used today is Turing Machine Code, from Turing’s 1936 paper “On Computable Numbers,” which is the most important mathematical paper ever written.

But you don’t have to learn it, except as a cute instructional tool, because what he proved is that his very simple language could do anything any machine could do no matter how complicated you made the instructions, although it might be rather slow.

Since then, a lot of other languages have been invented, and quite a few of them still matter. Here’s a list of computer languages from oldest to newest (and you can start with ANY of these, though the worst one to start with is C++):

  1. Turing Machine Code (1936) is the foundation, and you can prove things about it, although no machines use it for real work.
  2. FORTRAN (short for FORmula TRANslator) invented by John Backus at IBM in 1956, is very important for historical reasons and is still used for high-powered scientific computing, because so many tools have been developed in it. It’s a bit clunky, but not difficult, and you can actually get a job if you’re good enough at it.
  3. LISP (short for LISt Processor), invented by John McCarthy at MIT in 1959, is the simplest, most mathematical, and most elegant computer language in common use. It has been used in AI from the beginning because of its unparalleled flexibility. Its most important dialect is called SCHEME.

    However, LISP is used mostly for research or by superstar programmers who are so good that their clients allow them to use whatever the heck language they want to. Otherwise, it’s not so popular – not because it’s harder than other computer languages, but because it’s quite different, and programmers tend only to do what they need to do.

    That’s actually one of the secrets of being a good programmer: if you’re increasingly bored by repetitive tasks, you’re motivated to invent laborsaving tricks.

  4. BASIC (Beginner’s All-purpose Symbolic Instruction Code), invented by John Kemeny at Dartmouth in 1964, is probably the easiest language to learn from scratch, and some of us still use it all the time for quick projects that don’t have a lot of complicated parts. Still, some professors claim that it promotes bad programming habits.

    The modern version of BASIC is called “Visual Basic,” and it’s used inside Microsoft Excel if you want to do programming there.

  5. C (so-named because an early version of this programming language was named B, but don’t ask why) was invented by Dennis Ritchie at Bell Labs in 1972. It was developed to work in conjunction with the UNIX operating system (that’s a program that turns a computer from a big box of transistors into something that humans can talk to).

    C was designed to write programs that were powerful and efficient. It’s really easy to screw things up badly if you don’t know what you are doing, but the guys at Bell Labs didn’t care because they did know what they were doing, and C became by far the most widely used language for professional applications.

  6. SQL (Structured Query Language), was developed by Don Chamberlin and Raymond Boyce at IBM in 1973, building on the seminal work of Ted Codd. Unlike most of the other languages on this list, SQL is a specialized language rather than a general-purpose one.

    The specialty in this case is databases, and this is the era of Big Data. Being good at SQL gets more people jobs than being good at any other language. Because it’s a query language where you ask the computer for the answer without telling it how best to calculate the answer, it is both easier to use and easier to misuse, but generations of very smart programmers have improved implementations so that it is now quite smart about how to do stuff.

  7. C++ (don’t ask why), invented by Bjarne Stroustrup at Bell Labs in 1979, was an extension of C with a huge amount of tools and libraries and guardrails and extra rules and idiot proofing, so that ordinary non-Bell-Labs-quality programmers could work on big projects together without the projects dying from bugs.

    If you work as a programmer at a big company with a product that many programmers are needed to build, you probably learned this, but it’s painful.

  8. MATLAB is a proprietary programming environment (meaning it’s only usable with the products of one company, unlike the other languages here, but it’s worth it) developed by Cleve Moler at the University of New Mexico in the late 1970s and early 1980s. The language was so good that he quit to found the company MathWorks in 1984, and to sell it. MATLAB is primarily intended for numerical and symbolic computing, but over the years many tools have been developed in it for AI and other applications.
  9. Python (named after Monty, for real) was invented by Guido van Rossum in 1990 as a general-purpose programming language emphasizing simplicity, readability, flexibility, and fun. Previous languages had evolved for various technical reasons, and so they were not designed to make programmers happy, but by 1990 computers were fast enough and software tools were good enough that this could be done without a big sacrifice in efficiency. Python is very commonly taught in introductory courses, which shows that it succeeded.
  10. R (named for its inventors, Ross Ihaka and Robert Gentleman) was developed at the University of Auckland in 1993 for statistical applications. R is technically a general-purpose language, but it is mainly designed and used for statistical computing and graphics. R is the most important language for developing statistical applications and for data mining, and therefore a lot of the work talked about in this chapter.
  11. Java, invented by James Gosling at Sun Microsystems in 1994, is a clever way to do what C and C++ did, but in a less painful and more machine-independent fashion, which made it good for internet and device applications.
  12. JavaScript (no relation to Java, although in some ways, it resembles Java) was invented by Brendan Eich at Netscape in 1995. JavaScript is more like SCHEME under the hood, and is the core language of the internet, designed to make web pages easier to write.
  13. C# was invented by Microsoft in 2000 to be a less painful evolutionary successor to C and C++, and they did such a good job that this language is an exception to the rule that good languages carry out one guy’s brilliant and innovative vision.

There have been a bunch of good languages invented in this millennium too, but we hesitate to recommend them until they’ve been around long enough that they’re likely to stick around for a lot longer (this is called the Lindy effect).

“Okay, so we’ve learned about programming. How do we do AI?”

The short answer is to program in MATLAB because it has the most tools, or use R and Python and public libraries of AI tools if you want to not pay MathWorks any money.

“But . . . programming’s too hard! How do we do AI if we don’t want to write programs?”

The short answer is to still use MATLAB because it comes with a lot of applications packages, many of which you don’t need to be a programmer to learn how to use (and some of which we have talked about):

For statistical analysis:

  • Regression techniques, including linear, generalized linear, nonlinear, robust, regularized, ANOVA, repeated measures, and mixed-effects models.
  • Big Data algorithms for dimension reduction, descriptive statistics, k-means clustering, linear regression, logistic regression, and discriminant analysis.
  • Univariate and multivariate probability distributions, random and quasi-random number generators, and Markov chain samplers.
  • Hypothesis tests for distributions, dispersion, and location, and design of experiments (DOE) techniques for optimal, factorial, and response surface designs.

For Machine Learning:

  • Classification Learner app and algorithms for supervised Machine Learning, including support vector machines (SVMs), boosted and bagged decision trees, k-nearest neighbor, Naive Bayes, discriminant analysis, and Gaussian process regression.
  • Unsupervised machine learning algorithms, including k-means, k-medoids, hierarchical clustering, Gaussian mixtures, and hidden Markov models.
  • Bayesian optimization for tuning Machine Learning algorithms by searching for optimal hyper-parameters.

For Deep Learning and neural networks (quoting from MATLAB’s website):

  • Neural Network Toolbox provides algorithms, pre-trained models, and apps to create, train, visualize, and simulate both shallow and deep neural networks. You can perform classification, regression, clustering, dimensionality reduction, time-series forecasting, and dynamic system modeling and control.
  • Deep Learning networks include convolutional neural networks (ConvNets, CNNs), directed acyclic graph (DAG) network topologies, and autoencoders for image classification, regression, and feature learning. For time-series classification and regression, the toolbox provides long short-term memory (LSTM) deep learning networks. You can visualize intermediate layers and activations, modify network architecture, and monitor training progress.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.116.102