algorithms for nlp
play

Algorithms for NLP CS 11-711 Fall 2020 Lecture 3: Nonlinear text - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 3: Nonlinear text classification Emma Strubell Announcements Project 1: Text classification will be available after class today, due Friday September 25 Han will lead recitation this


  1. <latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit> <latexit sha1_base64="RXVil6+ULiMq+WxrHfc+47A0JnU=">ADjXicbVJb9MwFHYbOka5rINHXiKqS0bVTNA8IDQBA/wOCS6TWpK5TgnjTVfItvpFln5m0j8Fx5wkiItXS0l/vJ95ZzTpQxqs10+qfT9R709h7uP+o/fvL02cHg8PmFlrkiMCOSXUVYQ2MCpgZahcZQowjxhcRtdfK/1yDUpTKX6aIoMFxytBE0qwcdRykIURt6FJweDylx2Z42Bc+iGDxGCl5I2/JTvxuOaSclRdtxVLx+WJXzRg7L/ZrYcpNrYox8vBcDqZ1se/D4INGKLNOV8ednthLEnOQRjCsNbzYJqZhcXKUMKg7Ie5hgyTa7yCuYMCc9ALW7em9I8cE/uJVO4Rxq/Zux4Wc60LHjlLjk2qt7WK3KXNc5N8XFgqstyAIE2iJGe+kX7VZz+mCohQOYKOpq9UmKFSbGTaN/dDdNCmwNpvUjVid15hYXcfetQMANkZxjEb+2YI5ZUMCc6ZKW2ok/94V1tO4jXN9KZDt02L+m7aJpSKrqjArJp8Pf427a7UhPW7XYKtjV3iqlyZgbBlDQmTGsJopWSetYKX2/51UBcAJ64zjT203RoLtzjB9prcBxenk+Dt5PTHu+HZl80K7aOX6BUaoQB9QGfoOzpHM0TQb/S30+vseQfe+T97kx7XY2Pi9Q63jf/gFH5C9z</latexit> <latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit> <latexit sha1_base64="A8+pB2HUH1eWwMsv3S3SIvbkjY=">AERXicfVJLb9QwEM52eZTl1cKRi6GqtEtLtVsQcEGqgAMXRJHoA9XLauJMdq36EdlO28jK7+PKlR/BDXEFJ1mgaQuWEn+ZR+abmS/OBLduOPzaWeheunzl6uK13vUbN2/dXlq+s2t1bhjuMC202Y/BouAKdx3AvczgyBjgXvx4avKv3eExnKtPrgiw7GEqeIpZ+CabL0ZXG0lM3QwflJ93a6NBSajA1IEx+picQfnWm1Ly351nVRWPijXSdGAXl0sZ/OwPmiHPTmgLwgFMxUwsnEF4RyRagEN2Mg/MeyPFWUJZo94+ig0CH9Yu/35OleHGsD7kPBjNwUo0P9uT5YUxTLJSrHBFh7MBpmbuzBOM4Elj2aW8yAHcIUDwJUINGOfT36kqwGS0JSbcKjHKmtpzM8SGsLGYfIqjt71lcZL/Id5C59PvZcZblDxZpCaS6I06TaI0m4QeZEQAwNXwmZgLmw7d7q6TIzFEfo2o0wOfY2rau3KMUyfBtUeMy0lKCSh56mILkoEkwhF6701Ka/8UWjWU+OeGbnUzpxtQLgnJUGz7lCkQlrlphbXO4Zo7W7x59jWEXBt8Ggu8yNOC0CUwavZRhN1N6n1bwf5Fc/YkMsN2WrwmEZqoR6AyVLxv5CW2RxlOj86xF+Fx+T8ANIw8SYe2lNRBDk6Kz8zoPdzY3R43N909Wtl7OpbkY3YseRP1oFD2LtqI30Xa0E7HO0w7tYCftfu5+637v/mhCFzrznLtR63R/gL9xHct</latexit> How to obtain θ ? ■ The learning problem is to find the right weights θ . ■ Naïve Bayes: set θ equal to the empirical frequencies: P 0 count( y ) count( y , j ) ˆ φ y , j = p ( x j | y ) = = µ y = p ( y ) = ˆ = y 0 count( y 0 ) . P V P j 0 =1 count( y , j 0 ) ■ Perceptron update: θ ( t +1) ← θ ( t ) − θ + f ( x ( i ) , y ( i ) ) − f ( x ( i ) , ˆ y ) . ■ Large-margin update: θ ( t +1) ← θ ( t ) + f ( x ( i ) , y ( i ) ) − f ( x ( i ) , ˆ θ · f ( x ( i ) , y ) + c ( y ( i ) , y ) with y = argmax ˆ y ) y ∈ Y 6

  2. <latexit sha1_base64="FZhWaDfVpIRI7w8uTmyhx8A+678=">AH5HictVRdb9s2FUzZ261jzb4164BUGkJTUsb8AGDAWKbisGDO0yIGk7hI5A0ZTNRpQ0kqtEvwHexv2un+1h/2WvexKlF07dranCjB8eXnuB+85ZFJmXOnh8O9bO+/0dt/t37jv/f+Bx/evbf30TNVJKyM1pkhXyREMUynrMzXGXpSEZFk7Hly+W2z/yKScWL/FTXJRsLMs15yinR4Ir3evkBToTBesY0sRcm0EdRaBHOWKqJlMUrdG0bNo9aX2qD5m/eHloj1HtjBDd376PZ0Sb2ob+QWehBwgTORVkHpsaYZ4jLIieUZKZX6xdqYswnRT6hqoh9EOD+s26y49FZeMapQBYB6ghbucQv1jtFL26dPFLKM4naAFtGpHCSAbdZG2nqSTUnJ5Y+EHFx0/tGhCGTnkz5W3YE4d9HEfLzVF3pmVct+4KvskH8Z2vSYATmAckCaKWBVhcjEKX8aZMQe7XjBczbw4huC5DZaQ4w4QNnr4FTX9t8aoTfU/Mf7B4qyRNdERuzD3HZ3xJXA1x7pAr0M7t01xUEFLRP4dKG0BcLx09DSivJ8FezkEUer8KVUIAhEt4ke3YweDAZbQ8yP8Wt7UxiWfDrT4tT/61cJVUJkOvhxv0AWR86zc4d+huXOnTNpcF8me0w9N9CZ9DW219h7AdPdsvaTckP763PxwM2w9tGlFn7HvdxLv7YzxpKCVYLmGVHqPBqWemyI1JxmzPq4Uqwk9JM2TmYORFMjU37Mlp0AJ4JSgsJv1yj1rsaYhQqhYJIJvRqut7jXPb3nml06/HhudlpVlOXaG0yhDo3lm0YTDNdBZDQahkOviM4I3AYNjzEIZaXMjGVXTK8fhIqxUWlbfa2lRMBaspy9oUQJ98bnBKBM/qCUtJlWlrsEoX9rbRHE+ueKm6Kc3dmHygTuMCKOI5XF2gseVy3d0S6Gj08XcMuJDsCT4U8k0YWETtxjboGbKf4UN+Z/IeHZWCDBXD+WaRuAwzQjKEqWG+u0nxWK4WQqi6pca3gjvm0UEpAUJu7wbD3MIUCQ0X5bRrPRoPoi8Ho5y/3Hz7qpHnb+8T7zAu8yPvKe+j94J14Zx7t/dX7Z7e3u9tP+7/1f+/4aA7t7qYj721r/nv138vCY=</latexit> <latexit sha1_base64="RXVil6+ULiMq+WxrHfc+47A0JnU=">ADjXicbVJb9MwFHYbOka5rINHXiKqS0bVTNA8IDQBA/wOCS6TWpK5TgnjTVfItvpFln5m0j8Fx5wkiItXS0l/vJ95ZzTpQxqs10+qfT9R709h7uP+o/fvL02cHg8PmFlrkiMCOSXUVYQ2MCpgZahcZQowjxhcRtdfK/1yDUpTKX6aIoMFxytBE0qwcdRykIURt6FJweDylx2Z42Bc+iGDxGCl5I2/JTvxuOaSclRdtxVLx+WJXzRg7L/ZrYcpNrYox8vBcDqZ1se/D4INGKLNOV8ednthLEnOQRjCsNbzYJqZhcXKUMKg7Ie5hgyTa7yCuYMCc9ALW7em9I8cE/uJVO4Rxq/Zux4Wc60LHjlLjk2qt7WK3KXNc5N8XFgqstyAIE2iJGe+kX7VZz+mCohQOYKOpq9UmKFSbGTaN/dDdNCmwNpvUjVid15hYXcfetQMANkZxjEb+2YI5ZUMCc6ZKW2ok/94V1tO4jXN9KZDt02L+m7aJpSKrqjArJp8Pf427a7UhPW7XYKtjV3iqlyZgbBlDQmTGsJopWSetYKX2/51UBcAJ64zjT203RoLtzjB9prcBxenk+Dt5PTHu+HZl80K7aOX6BUaoQB9QGfoOzpHM0TQb/S30+vseQfe+T97kx7XY2Pi9Q63jf/gFH5C9z</latexit> <latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit> <latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit> <latexit sha1_base64="A8+pB2HUH1eWwMsv3S3SIvbkjY=">AERXicfVJLb9QwEM52eZTl1cKRi6GqtEtLtVsQcEGqgAMXRJHoA9XLauJMdq36EdlO28jK7+PKlR/BDXEFJ1mgaQuWEn+ZR+abmS/OBLduOPzaWeheunzl6uK13vUbN2/dXlq+s2t1bhjuMC202Y/BouAKdx3AvczgyBjgXvx4avKv3eExnKtPrgiw7GEqeIpZ+CabL0ZXG0lM3QwflJ93a6NBSajA1IEx+picQfnWm1Ly351nVRWPijXSdGAXl0sZ/OwPmiHPTmgLwgFMxUwsnEF4RyRagEN2Mg/MeyPFWUJZo94+ig0CH9Yu/35OleHGsD7kPBjNwUo0P9uT5YUxTLJSrHBFh7MBpmbuzBOM4Elj2aW8yAHcIUDwJUINGOfT36kqwGS0JSbcKjHKmtpzM8SGsLGYfIqjt71lcZL/Id5C59PvZcZblDxZpCaS6I06TaI0m4QeZEQAwNXwmZgLmw7d7q6TIzFEfo2o0wOfY2rau3KMUyfBtUeMy0lKCSh56mILkoEkwhF6701Ka/8UWjWU+OeGbnUzpxtQLgnJUGz7lCkQlrlphbXO4Zo7W7x59jWEXBt8Ggu8yNOC0CUwavZRhN1N6n1bwf5Fc/YkMsN2WrwmEZqoR6AyVLxv5CW2RxlOj86xF+Fx+T8ANIw8SYe2lNRBDk6Kz8zoPdzY3R43N909Wtl7OpbkY3YseRP1oFD2LtqI30Xa0E7HO0w7tYCftfu5+637v/mhCFzrznLtR63R/gL9xHct</latexit> How to obtain θ ? ■ The learning problem is to find the right weights θ . ■ Naïve Bayes: set θ equal to the empirical frequencies: P 0 count( y ) count( y , j ) ˆ φ y , j = p ( x j | y ) = = µ y = p ( y ) = ˆ = y 0 count( y 0 ) . P V P j 0 =1 count( y , j 0 ) ■ Perceptron update: θ ( t +1) ← θ ( t ) − θ + f ( x ( i ) , y ( i ) ) − f ( x ( i ) , ˆ y ) . ■ Large-margin update: θ ( t +1) ← θ ( t ) + f ( x ( i ) , y ( i ) ) − f ( x ( i ) , ˆ θ · f ( x ( i ) , y ) + c ( y ( i ) , y ) with y = argmax ˆ y ) y ∈ Y ■ Logistic regression update: θ ( t +1) ← θ ( t ) + f ( x ( i ) , y ( i ) ) − E y | x h i f ( x ( i ) , y ) 6

  3. <latexit sha1_base64="FZhWaDfVpIRI7w8uTmyhx8A+678=">AH5HictVRdb9s2FUzZ261jzb4164BUGkJTUsb8AGDAWKbisGDO0yIGk7hI5A0ZTNRpQ0kqtEvwHexv2un+1h/2WvexKlF07dranCjB8eXnuB+85ZFJmXOnh8O9bO+/0dt/t37jv/f+Bx/evbf30TNVJKyM1pkhXyREMUynrMzXGXpSEZFk7Hly+W2z/yKScWL/FTXJRsLMs15yinR4Ir3evkBToTBesY0sRcm0EdRaBHOWKqJlMUrdG0bNo9aX2qD5m/eHloj1HtjBDd376PZ0Sb2ob+QWehBwgTORVkHpsaYZ4jLIieUZKZX6xdqYswnRT6hqoh9EOD+s26y49FZeMapQBYB6ghbucQv1jtFL26dPFLKM4naAFtGpHCSAbdZG2nqSTUnJ5Y+EHFx0/tGhCGTnkz5W3YE4d9HEfLzVF3pmVct+4KvskH8Z2vSYATmAckCaKWBVhcjEKX8aZMQe7XjBczbw4huC5DZaQ4w4QNnr4FTX9t8aoTfU/Mf7B4qyRNdERuzD3HZ3xJXA1x7pAr0M7t01xUEFLRP4dKG0BcLx09DSivJ8FezkEUer8KVUIAhEt4ke3YweDAZbQ8yP8Wt7UxiWfDrT4tT/61cJVUJkOvhxv0AWR86zc4d+huXOnTNpcF8me0w9N9CZ9DW219h7AdPdsvaTckP763PxwM2w9tGlFn7HvdxLv7YzxpKCVYLmGVHqPBqWemyI1JxmzPq4Uqwk9JM2TmYORFMjU37Mlp0AJ4JSgsJv1yj1rsaYhQqhYJIJvRqut7jXPb3nml06/HhudlpVlOXaG0yhDo3lm0YTDNdBZDQahkOviM4I3AYNjzEIZaXMjGVXTK8fhIqxUWlbfa2lRMBaspy9oUQJ98bnBKBM/qCUtJlWlrsEoX9rbRHE+ueKm6Kc3dmHygTuMCKOI5XF2gseVy3d0S6Gj08XcMuJDsCT4U8k0YWETtxjboGbKf4UN+Z/IeHZWCDBXD+WaRuAwzQjKEqWG+u0nxWK4WQqi6pca3gjvm0UEpAUJu7wbD3MIUCQ0X5bRrPRoPoi8Ho5y/3Hz7qpHnb+8T7zAu8yPvKe+j94J14Zx7t/dX7Z7e3u9tP+7/1f+/4aA7t7qYj721r/nv138vCY=</latexit> <latexit sha1_base64="RXVil6+ULiMq+WxrHfc+47A0JnU=">ADjXicbVJb9MwFHYbOka5rINHXiKqS0bVTNA8IDQBA/wOCS6TWpK5TgnjTVfItvpFln5m0j8Fx5wkiItXS0l/vJ95ZzTpQxqs10+qfT9R709h7uP+o/fvL02cHg8PmFlrkiMCOSXUVYQ2MCpgZahcZQowjxhcRtdfK/1yDUpTKX6aIoMFxytBE0qwcdRykIURt6FJweDylx2Z42Bc+iGDxGCl5I2/JTvxuOaSclRdtxVLx+WJXzRg7L/ZrYcpNrYox8vBcDqZ1se/D4INGKLNOV8ednthLEnOQRjCsNbzYJqZhcXKUMKg7Ie5hgyTa7yCuYMCc9ALW7em9I8cE/uJVO4Rxq/Zux4Wc60LHjlLjk2qt7WK3KXNc5N8XFgqstyAIE2iJGe+kX7VZz+mCohQOYKOpq9UmKFSbGTaN/dDdNCmwNpvUjVid15hYXcfetQMANkZxjEb+2YI5ZUMCc6ZKW2ok/94V1tO4jXN9KZDt02L+m7aJpSKrqjArJp8Pf427a7UhPW7XYKtjV3iqlyZgbBlDQmTGsJopWSetYKX2/51UBcAJ64zjT203RoLtzjB9prcBxenk+Dt5PTHu+HZl80K7aOX6BUaoQB9QGfoOzpHM0TQb/S30+vseQfe+T97kx7XY2Pi9Q63jf/gFH5C9z</latexit> <latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit> <latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit> <latexit sha1_base64="A8+pB2HUH1eWwMsv3S3SIvbkjY=">AERXicfVJLb9QwEM52eZTl1cKRi6GqtEtLtVsQcEGqgAMXRJHoA9XLauJMdq36EdlO28jK7+PKlR/BDXEFJ1mgaQuWEn+ZR+abmS/OBLduOPzaWeheunzl6uK13vUbN2/dXlq+s2t1bhjuMC202Y/BouAKdx3AvczgyBjgXvx4avKv3eExnKtPrgiw7GEqeIpZ+CabL0ZXG0lM3QwflJ93a6NBSajA1IEx+picQfnWm1Ly351nVRWPijXSdGAXl0sZ/OwPmiHPTmgLwgFMxUwsnEF4RyRagEN2Mg/MeyPFWUJZo94+ig0CH9Yu/35OleHGsD7kPBjNwUo0P9uT5YUxTLJSrHBFh7MBpmbuzBOM4Elj2aW8yAHcIUDwJUINGOfT36kqwGS0JSbcKjHKmtpzM8SGsLGYfIqjt71lcZL/Id5C59PvZcZblDxZpCaS6I06TaI0m4QeZEQAwNXwmZgLmw7d7q6TIzFEfo2o0wOfY2rau3KMUyfBtUeMy0lKCSh56mILkoEkwhF6701Ka/8UWjWU+OeGbnUzpxtQLgnJUGz7lCkQlrlphbXO4Zo7W7x59jWEXBt8Ggu8yNOC0CUwavZRhN1N6n1bwf5Fc/YkMsN2WrwmEZqoR6AyVLxv5CW2RxlOj86xF+Fx+T8ANIw8SYe2lNRBDk6Kz8zoPdzY3R43N909Wtl7OpbkY3YseRP1oFD2LtqI30Xa0E7HO0w7tYCftfu5+637v/mhCFzrznLtR63R/gL9xHct</latexit> How to obtain θ ? ■ The learning problem is to find the right weights θ . ■ Naïve Bayes: set θ equal to the empirical frequencies: P 0 count( y ) count( y , j ) ˆ φ y , j = p ( x j | y ) = = µ y = p ( y ) = ˆ = y 0 count( y 0 ) . P V P j 0 =1 count( y , j 0 ) ■ Perceptron update: θ ( t +1) ← θ ( t ) − θ + f ( x ( i ) , y ( i ) ) − f ( x ( i ) , ˆ y ) . ■ Large-margin update: θ ( t +1) ← θ ( t ) + f ( x ( i ) , y ( i ) ) − f ( x ( i ) , ˆ θ · f ( x ( i ) , y ) + c ( y ( i ) , y ) with y = argmax ˆ y ) y ∈ Y ■ Logistic regression update: θ ( t +1) ← θ ( t ) + f ( x ( i ) , y ( i ) ) − E y | x h i f ( x ( i ) , y ) ■ All these methods for supervised learning assume a labeled dataset of N examples: aset { ( x ( i ) , y ( i ) ) } N i =1 . 6

  4. Today Nonlinear classification & evaluating classifiers Engineered features 7

  5. Today Nonlinear classification & evaluating classifiers Engineered features linear classification 7

  6. Today Nonlinear classification & evaluating classifiers Engineered features Learned features linear classification ~mid 2010s 7

  7. Today Nonlinear classification & evaluating classifiers Engineered features Learned features linear classification nonlinear classification ~mid 2010s 7

  8. A simple feed-forward architecture 8

  9. A simple feed-forward architecture ■ Suppose we want to label stories as s Y = { Good , Bad , Okay } . 8

  10. A simple feed-forward architecture ■ Suppose we want to label stories as s Y = { Good , Bad , Okay } . ■ What makes a good story? 8

  11. A simple feed-forward architecture ■ Suppose we want to label stories as s Y = { Good , Bad , Okay } . ■ What makes a good story? ■ Exciting plot, compelling characters, interesting setting… 8

  12. A simple feed-forward architecture ■ Suppose we want to label stories as s Y = { Good , Bad , Okay } . ■ What makes a good story? ■ Exciting plot, compelling characters, interesting setting… ■ Let’s call this vector of features z . 8

  13. A simple feed-forward architecture ■ Suppose we want to label stories as s Y = { Good , Bad , Okay } . ■ What makes a good story? ■ Exciting plot, compelling characters, interesting setting… ■ Let’s call this vector of features z . ■ If z is well-chosen, it will be easy to predict from x (the words), and it will make it easy to predict the label, y . 8

  14. A simple feed-forward architecture y z . . . x . . . 9

  15. A simple feed-forward architecture ■ Let’s predict each z k from x by binary logistic regression: Pr( z k = 1 | x ) = σ ( θ ( x → z ) · x ) k y z . . . x . . . 9

  16. <latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit> A simple feed-forward architecture logistic fn aka sigmoid σ 1 ■ Let’s predict each z k from x by binary logistic = 1 + e − θ ( x → z ) regression: x k Pr( z k = 1 | x ) = σ ( θ ( x → z ) · x ) k y z . . . x . . . 9

  17. <latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit> A simple feed-forward architecture logistic fn aka sigmoid σ 1 ■ Let’s predict each z k from x by binary logistic = 1 + e − θ ( x → z ) regression: x k Pr( z k = 1 | x ) = σ ( θ ( x → z ) · x ) k y z . . . x . . . 9

  18. <latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit> A simple feed-forward architecture logistic fn aka sigmoid σ 1 ■ Let’s predict each z k from x by binary logistic = 1 + e − θ ( x → z ) regression: x k Pr( z k = 1 | x ) = σ ( θ ( x → z ) · x ) k y ■ The weights can be collected into a matrix, Θ ( x ! z ) = [ θ ( x ! z ) , θ ( x ! z ) , . . . , θ ( x ! z ) ] > , z . . . 1 2 K z x . . . 9

  19. <latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit> A simple feed-forward architecture logistic fn aka sigmoid σ 1 ■ Let’s predict each z k from x by binary logistic = 1 + e − θ ( x → z ) regression: x k Pr( z k = 1 | x ) = σ ( θ ( x → z ) · x ) k y ■ The weights can be collected into a matrix, Θ ( x ! z ) = [ θ ( x ! z ) , θ ( x ! z ) , . . . , θ ( x ! z ) ] > , z . . . 1 2 K z ■ so that E[ z ] = σ ( Θ ( x → z ) x ), where σ is applied x . . . element-wise. 9

  20. <latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit> A simple feed-forward architecture logistic fn aka sigmoid σ 1 ■ Let’s predict each z k from x by binary logistic = 1 + e − θ ( x → z ) regression: x k Pr( z k = 1 | x ) = σ ( θ ( x → z ) · x ) k y ■ The weights can be collected into a matrix, Θ ( x ! z ) = [ θ ( x ! z ) , θ ( x ! z ) , . . . , θ ( x ! z ) ] > , z . . . 1 2 K z ■ so that E[ z ] = σ ( Θ ( x → z ) x ), where σ is applied x . . . element-wise. matrix-vector product. dims: [k, V] * [V, 1] = [k, 1] 9

  21. A simple feed-forward architecture y z . . . x . . . 10

  22. A simple feed-forward architecture ■ Next we predict y from z , again using logistic regression (multiclass): | exp( θ ( z → y ) · z + b j ) j Pr( y = j | z ) = j 0 ∈ Y exp( θ ( z → y ) , P · z + b j 0 ) y j 0 Vector of probabilities over each possible y is denoted: z . . . x . . . 10

  23. A simple feed-forward architecture ■ Next we predict y from z , again using logistic additive bias/offset vector regression (multiclass): | exp( θ ( z → y ) · z + b j ) j Pr( y = j | z ) = j 0 ∈ Y exp( θ ( z → y ) , P · z + b j 0 ) y j 0 Vector of probabilities over each possible y is denoted: z . . . x . . . 10

  24. A simple feed-forward architecture ■ Next we predict y from z , again using logistic additive bias/offset vector regression (multiclass): | exp( θ ( z → y ) · z + b j ) j Pr( y = j | z ) = j 0 ∈ Y exp( θ ( z → y ) , P · z + b j 0 ) y j 0 Vector of probabilities over each possible y is denoted: z . . . p ( y | z ) = SoftMax( Θ ( z → y ) z + b ) . x . . . 10

  25. A simple feed-forward architecture y z . . . x . . . 11

  26. A simple feed-forward architecture ■ In reality, we never observe z , it is a hidden layer . We don’t bother predicting 0/1 values for z , we compute it directly from x . y z . . . x . . . 11

  27. A simple feed-forward architecture ■ In reality, we never observe z , it is a hidden layer . We don’t bother predicting 0/1 values for z , we compute it directly from x . ■ This makes p( y | x ) a complex, nonlinear function of x . y z . . . x . . . 11

  28. A simple feed-forward architecture ■ In reality, we never observe z , it is a hidden layer . We don’t bother predicting 0/1 values for z , we compute it directly from x . ■ This makes p( y | x ) a complex, nonlinear function of x . ■ We can have multiple hidden layers z (1) , z (2) , … adding even more expressiveness. y z . . . x . . . 11

  29. <latexit sha1_base64="naie3XV/whZb+QED1FE+NWJT7wI=">AGv3ichVRb9MwFM5WKPcNnjkxTBNathWNQUJXiaNiyYQAoa0G5rbyHGd1lucBMfZmln+T/waJ7gp3By6dZuHUSKcnzOd+5f7MUBT1S7/WtuvnbjZv3Wwu3Gnbv37j9YXHq4l0SpGyXRkEkDzySsICHbFdxFbCDWDIivIDte8dvc/v+CZMJj8IdlcWsK8g5D6nRIHKXap9WMGe0FgNmSKmp5tq1bENwgHzFZEyOkWXzGBcLXS+aeafUa7ltlDWSnYaH2HQ+J0pmxGyuVhDYQJnIgyMjVGcI8RFgQNaQk0N+MmciLMO1H6pqsNtRDm9nFuYqPRWrcDHLETcBsoLE6HnID+dbQkSmMZTj3CJLzPhpD80Kk0JBNUFRqS8J1TvbBl7IuPXZTAFh6JTnU56F3S6xW65zbuxUPZ37Vecq4U8K90eQDswTwgSNMptgCHXscuI14XqVnBLie0JyOP2xA8NM1zyFoFsHM+fEd5/YXQKUL9x6exMu7VMdpZT29Xq7TPYZdjbCK0JltRsY0cMKBMUi8M6YaGNAuZ58KwUnDyfBJTtcZxJ+zhRwAs5dRXeuR7darZku+qN7Zq5zw5IPhqrb23EXl9utdvGgq4JTCctW9Wy7S/Nd3I9oKlioaECS5NBpx6qriVScBgwGkyYsJvSYDNghiCERLOnq4qc3aAU0feRHEt5QoUI76aGJSJMeIDMF5VctuXKWbDVPmvupqHcapYSMtEfhog6D2/QVCfw4ZVkIFAqORQK6JDAotWcM8AlSbSDFlwtR0I1R0deIX2adK8gScJQvZKY2EIGH/mcY+ETzI+swnaCMxok/lmeNZq1/wuOkmtKoHFMDWKNwBDviIbASGFTQaFpdbLDcYwO/Y7ALyT5BgV9iJomKJFRS3lMGdjPAT3Au/gsJf8QYCeJ0W7oAJrJRxDFLNSmvPaCKGHYG8gojacKvuJfFAoBiA8TL/Fs2q1EACGdy/S7Kux1Ws7zVufri+XNxU1F6zH1lOraTnWS2vTem9tW7sWrf2o/az9rv2pv64P6mE9LqHzc5XPI2vqWd/AS9EVTg=</latexit> A simple feed-forward architecture ■ In reality, we never observe z , it is a hidden layer . We don’t bother predicting 0/1 values for z , we compute it directly from x . ■ This makes p( y | x ) a complex, nonlinear function of x . ■ We can have multiple hidden layers z (1) , z (2) , … adding even more expressiveness. ■ To summarize: y z = σ ( Θ ( x → z ) x ) z . . . p ( y | z ) = SoftMax( Θ ( z → y ) z + b ) . where x . . . i T h σ ( θ ( x → z ) · x ), σ ( θ ( x → z ) · x ), ..., σ ( θ ( x → z ) σ ( Θ ( x → z ) x ) = · x ) 1 2 K z 11

  30. Activation functions 12

  31. Activation functions ■ The sigmoid in is called an activation function . in z = σ ( Θ ( x → z ) x ) i 12

  32. Activation functions ■ The sigmoid in is called an activation function . in z = σ ( Θ ( x → z ) x ) i ■ In general, we can write to indicate an arbitrary activation function. e z = f ( Θ ( x → z ) x ) t 12

  33. Activation functions ■ The sigmoid in is called an activation function . in z = σ ( Θ ( x → z ) x ) i ■ In general, we can write to indicate an arbitrary activation function. e z = f ( Θ ( x → z ) x ) t ■ Other choices include: 12

  34. Activation functions ■ The sigmoid in is called an activation function . in z = σ ( Θ ( x → z ) x ) i ■ In general, we can write to indicate an arbitrary activation function. e z = f ( Θ ( x → z ) x ) t ■ Other choices include: ■ Hyperbolic tangent : tanh, centered at 0, helps avoid saturation 12

  35. Activation functions ■ The sigmoid in is called an activation function . in z = σ ( Θ ( x → z ) x ) i ■ In general, we can write to indicate an arbitrary activation function. e z = f ( Θ ( x → z ) x ) t ■ Other choices include: ■ Hyperbolic tangent : tanh, centered at 0, helps avoid saturation ■ Rectified linear unit : ReLU( a ) = max(0, a ), which is fast to evaluate, easy to analyze, even further avoids saturation. 12

  36. Activation functions ■ The sigmoid in is called an activation function . in z = σ ( Θ ( x → z ) x ) i ■ In general, we can write to indicate an arbitrary activation function. e z = f ( Θ ( x → z ) x ) t ■ Other choices include: ■ Hyperbolic tangent : tanh, centered at 0, helps avoid saturation ■ Rectified linear unit : ReLU( a ) = max(0, a ), which is fast to evaluate, easy to analyze, even further avoids saturation. ■ Leaky ReLU : ( a ≥ 0 a, ) = . 0001 a, otherwise . 12

  37. Training neural networks Gradient descent 13

  38. Training neural networks Gradient descent ■ In general, neural networks are learned by gradient descent, using minibatches: θ ( z ! y ) θ ( z ! y ) � ⌘ ( t ) r θ ( z → y ) ` ( i ) , k k k where 13

  39. Training neural networks Gradient descent ■ In general, neural networks are learned by gradient descent, using minibatches: θ ( z ! y ) θ ( z ! y ) � ⌘ ( t ) r θ ( z → y ) ` ( i ) , k k k where I ⌘ ( t ) ■ is the learning rate at update t 13

  40. Training neural networks Gradient descent ■ In general, neural networks are learned by gradient descent, using minibatches: θ ( z ! y ) θ ( z ! y ) � ⌘ ( t ) r θ ( z → y ) ` ( i ) , k k k where I ⌘ ( t ) ■ is the learning rate at update t I ` ( i ) ■ is the loss on instance (minibatch) i 13

  41. Training neural networks Gradient descent ■ In general, neural networks are learned by gradient descent, using minibatches: θ ( z ! y ) θ ( z ! y ) � ⌘ ( t ) r θ ( z → y ) ` ( i ) , k k k where I ⌘ ( t ) ■ is the learning rate at update t I ` ( i ) ■ is the loss on instance (minibatch) i ` ( i ) I r θ ( z → y ) ■ is the gradient of the loss with respect to the output weights ts θ ( z ! y ) , k k   @` ( i ) @` ( i ) @` ( i ) ` ( i ) = r θ ( z → y )   @✓ ( z ! y ) , @✓ ( z ! y ) , . . . , @✓ ( z ! y ) k k , 1 k , 2 k , K y 13

  42. Training neural networks Backpropagation 14

  43. Training neural networks Backpropagation ■ If we don’t observe z , how can we learn the weights ? ts Θ ( x ! z ) 14

  44. Training neural networks Backpropagation ■ If we don’t observe z , how can we learn the weights ? ts Θ ( x ! z ) ■ Backpropagation : compute a loss on y , and apply the chain rule from calculus to compute a gradient on all parameters. 14

  45. Training neural networks Backpropagation ■ If we don’t observe z , how can we learn the weights ? ts Θ ( x ! z ) ■ Backpropagation : compute a loss on y , and apply the chain rule from calculus to compute a gradient on all parameters. ■ Backpropagation as an algorithm: construct a directed acyclic computation graph with nodes for inputs, outputs, hidden layers, parameters. 14

  46. Training neural networks Backpropagation ■ Backpropagation as an algorithm: construct a directed acyclic computation graph with nodes for inputs, outputs, hidden layers, parameters. v z v ˆ v y v x y y ( i ) x ( i ) ` ( i ) ˆ z y g z g � g � g ˆ y g ˆ g z y v Θ v Θ Θ 15

  47. Training neural networks Backpropagation ■ Backpropagation as an algorithm: construct a directed acyclic computation graph with nodes for inputs, outputs, hidden layers, parameters. ■ Forward pass : values (e.g. v x ) go from parents to children v z v ˆ v y v x y y ( i ) x ( i ) ` ( i ) ˆ z y g z g � g � g ˆ y g ˆ g z y v Θ v Θ Θ 15

  48. Training neural networks Backpropagation ■ Backpropagation as an algorithm: construct a directed acyclic computation graph with nodes for inputs, outputs, hidden layers, parameters. ■ Forward pass : values (e.g. v x ) go from parents to children ■ Backward pass : gradients (e.g. g z ) go from children to parents, implementing the chain rule v z v ˆ v y v x y y ( i ) x ( i ) ` ( i ) ˆ z y g z g � g � g ˆ y g ˆ g z y v Θ v Θ Θ 15

  49. Training neural networks Backpropagation ■ Backpropagation as an algorithm: construct a directed acyclic computation graph with nodes for inputs, outputs, hidden layers, parameters. ■ Forward pass : values (e.g. v x ) go from parents to children ■ Backward pass : gradients (e.g. g z ) go from children to parents, implementing the chain rule ■ As long as the gradient is implemented for a layer/operation, you can add it to the graph, and let automatic di ff erentiation compute updates for every layer. v z v ˆ v y v x y y ( i ) x ( i ) ` ( i ) ˆ z y g z g � g � g ˆ y g ˆ g z y v Θ v Θ Θ 15

  50. How to represent text for classification? Another choice of R : word embeddings 16

  51. How to represent text for classification? Another choice of R : word embeddings ■ Text is naturally viewed as a sequence of tokens w 1 , w 2 , …, w T 16

  52. How to represent text for classification? Another choice of R : word embeddings ■ Text is naturally viewed as a sequence of tokens w 1 , w 2 , …, w T ■ Context is lost when this sequence is converted to a bag-of-words. 16

  53. How to represent text for classification? Another choice of R : word embeddings ■ Text is naturally viewed as a sequence of tokens w 1 , w 2 , …, w T ■ Context is lost when this sequence is converted to a bag-of-words. ■ Instead, a lookup layer can compute embeddings (real-valued vectors) for n, resulting in a m each type, resulting in a matrix where X (0) ∈ R K e × M . -1.36 1.77 0.71 -0.25 0.11 -1.36 0.03 0.71 -0.45 -0.23 -0.58 1.43 -1.27 -0.71 -0.23 0.69 1.43 1.88 0.84 -0.33 0.11 1.36 -1.08 0.84 0.14 0.11 -0.18 … … … … … … … … … -0.067 0.93 -5.6 0.74 -0.07 -0.067 -0.36 -5.6 -1.58 s g e e e t d s e k r n u h r o n e h n e = o b w t c w a t i w r r a l t d b s t 16

  54. Evaluating classifiers 17

  55. Evaluating your classifier 18

  56. Evaluating your classifier ■ Want to predict future performance, on unseen data. 18

  57. Evaluating your classifier ■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: 18

  58. Evaluating your classifier ■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training 18

  59. Evaluating your classifier ■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection 18

  60. Evaluating your classifier ■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection ■ selecting classification model, model structure 18

  61. Evaluating your classifier ■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection ■ selecting classification model, model structure ■ preprocessing decisions, such as vocabulary selection 18

  62. Evaluating your classifier ■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection ■ selecting classification model, model structure ■ preprocessing decisions, such as vocabulary selection ■ Even if you follow all these rules, you will probably still over-estimate your classifier’s performance, because real future data will differ from your test set in ways that you cannot anticipate. 18

  63. Accuracy 19

  64. Accuracy ■ Most basic metric: accuracy . How often is the classifier right? N y ) = 1 δ ( y ( i ) = ˆ X acc( y , ˆ y ) . N i =1 The problem with accuracy is rare labels , also known as class imbalance . 19

  65. Accuracy ■ Most basic metric: accuracy . How often is the classifier right? N y ) = 1 δ ( y ( i ) = ˆ X acc( y , ˆ y ) . N i =1 The problem with accuracy is rare labels , also known as class imbalance . 19

  66. Accuracy ■ Most basic metric: accuracy . How often is the classifier right? N y ) = 1 δ ( y ( i ) = ˆ X acc( y , ˆ y ) . N i =1 The problem with accuracy is rare labels , also known as class imbalance . ■ Consider a system for detecting whether a tweet is written in Telugu. 19

  67. Accuracy ■ Most basic metric: accuracy . How often is the classifier right? N y ) = 1 δ ( y ( i ) = ˆ X acc( y , ˆ y ) . N i =1 The problem with accuracy is rare labels , also known as class imbalance . ■ Consider a system for detecting whether a tweet is written in Telugu. ■ 0.3% of tweets are written in Telugu [Bergsma et al. 2012]. 19

  68. Accuracy ■ Most basic metric: accuracy . How often is the classifier right? N y ) = 1 δ ( y ( i ) = ˆ X acc( y , ˆ y ) . N i =1 The problem with accuracy is rare labels , also known as class imbalance . ■ Consider a system for detecting whether a tweet is written in Telugu. ■ 0.3% of tweets are written in Telugu [Bergsma et al. 2012]. ■ A system that says ŷ = NotTelugu 100% of the time is 99.7% accurate. 19

  69. Beyond “right” and “wrong” correct labels predicted labels 20

  70. Beyond “right” and “wrong” correct labels ■ For any label, there are two ways to be “wrong:” predicted labels 20

  71. Beyond “right” and “wrong” correct labels ■ For any label, there are two ways to be “wrong:” ■ False positive : the system incorrectly predicts the label. predicted labels 20

  72. Beyond “right” and “wrong” correct labels ■ For any label, there are two ways to be “wrong:” ■ False positive : the system incorrectly predicts the label. ■ False negative : the system incorrectly fails to predict the label. predicted labels 20

  73. Beyond “right” and “wrong” correct labels ■ For any label, there are two ways to be “wrong:” ■ False positive : the system incorrectly predicts the label. ■ False negative : the system incorrectly fails to predict the label. ■ Similarly, there are two ways to be “right:” predicted labels 20

  74. Beyond “right” and “wrong” correct labels ■ For any label, there are two ways to be “wrong:” ■ False positive : the system incorrectly predicts the label. ■ False negative : the system incorrectly fails to predict the label. ■ Similarly, there are two ways to be “right:” ■ True positive : the system correctly predicts the label. predicted labels 20

  75. Beyond “right” and “wrong” correct labels ■ For any label, there are two ways to be “wrong:” ■ False positive : the system incorrectly predicts the label. ■ False negative : the system incorrectly fails to predict the label. ■ Similarly, there are two ways to be “right:” ■ True positive : the system correctly predicts the label. ■ True negative : the system correctly predicts that the label does not apply to this instance. predicted labels 20

  76. Precision and recall correct labels predicted labels 21

  77. <latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit> Precision and recall ■ Recall : fraction of positive instances that were correct labels correctly classified. TP recall = TP + FN = predicted labels 21

  78. <latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit> Precision and recall ■ Recall : fraction of positive instances that were correct labels correctly classified. TP recall = TP + FN = ■ The “never Telugu” classifier has 0 recall. predicted labels 21

  79. <latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit> Precision and recall ■ Recall : fraction of positive instances that were correct labels correctly classified. TP recall = TP + FN = ■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall. predicted labels 21

  80. <latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit> <latexit sha1_base64="sCHI9VBUq0LibUzZDHZL9e36d8=">AE3XicfVPb9MwFE5GgVF+bXDkYpgmtWxU7UCy6QJEOICFGm/0Fwqx3Fab3Yc2c62yPKRG+LKf8Wd/4MriJekhXUrWEr8r3Pfp8/v0SZ4MZ2u9/DhUuNy1euLl5rXr9x89btpeU7u0blmrIdqoTS+xExTPCU7VhuBdvPNCMyEmwvOnpR5veOmTZcpdu2yNhAklHKE06JBWi4HI5WcSQdtmNmif/oWnat1/YIC5ZYorU6QefSkFyrsMS3yum0RHnbr6OiDtro0fw8HhPrCt9urk4itIkw0SNJToeuQJinCEtix5QI98H7M3URprGy/6jaBj20Vfz9nuyPZe6HBdTIWsDZRFM4G3MP9dbRoa+S9XbDQyjOYzSlkK0dJqBGlEpTShbrv4YGKr96WYHPKA8pL02eR+2X4HBpdvpVgNdDHqTYCWYjP5weWGAY0VzyVJLBTHmoNfN7MARbTkVzDdxblhG6BEZsQMIUyKZGbiqJTxaBSRGidLwpBZV6NkVjkhjChkBszyCOZ8rwXm5g9wmzwaOp1luWUrQkukFWo7C8Uc7DCigICQjUHrYiOCRhioQvB2DNlxkwcMzt7ECoHziRV9RlJkYRvzVJ2QpWUJI0fOpwQyURs4TkwnqHTKN51mzHh/zExcOq1takKfW6w0H/EU7hl6vmr8WRimscXVu4lfMrgLzd6AwHcZ08QqDUrqLvZwNyN8H5fh/5g8/cOEcPZYrhIAhyktUBlLna9/CqEMw9FIqzybEXxhfSUNiAJOF7z2eymgEN2TvfheD3Y1O73Fn4/2Tla3nk9ZcDO4FD4JW0AueBlvB6Af7AQ0/Bb+CH+GvxrDxqfG58aXmroQTtbcDWZG4+tvpRipKQ=</latexit> Precision and recall ■ Recall : fraction of positive instances that were correct labels correctly classified. TP recall = TP + FN = ■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall. ■ Precision : fraction of positive predictions that were correct. TP precision = TP + FP = predicted labels 21

  81. <latexit sha1_base64="sCHI9VBUq0LibUzZDHZL9e36d8=">AE3XicfVPb9MwFE5GgVF+bXDkYpgmtWxU7UCy6QJEOICFGm/0Fwqx3Fab3Yc2c62yPKRG+LKf8Wd/4MriJekhXUrWEr8r3Pfp8/v0SZ4MZ2u9/DhUuNy1euLl5rXr9x89btpeU7u0blmrIdqoTS+xExTPCU7VhuBdvPNCMyEmwvOnpR5veOmTZcpdu2yNhAklHKE06JBWi4HI5WcSQdtmNmif/oWnat1/YIC5ZYorU6QefSkFyrsMS3yum0RHnbr6OiDtro0fw8HhPrCt9urk4itIkw0SNJToeuQJinCEtix5QI98H7M3URprGy/6jaBj20Vfz9nuyPZe6HBdTIWsDZRFM4G3MP9dbRoa+S9XbDQyjOYzSlkK0dJqBGlEpTShbrv4YGKr96WYHPKA8pL02eR+2X4HBpdvpVgNdDHqTYCWYjP5weWGAY0VzyVJLBTHmoNfN7MARbTkVzDdxblhG6BEZsQMIUyKZGbiqJTxaBSRGidLwpBZV6NkVjkhjChkBszyCOZ8rwXm5g9wmzwaOp1luWUrQkukFWo7C8Uc7DCigICQjUHrYiOCRhioQvB2DNlxkwcMzt7ECoHziRV9RlJkYRvzVJ2QpWUJI0fOpwQyURs4TkwnqHTKN51mzHh/zExcOq1takKfW6w0H/EU7hl6vmr8WRimscXVu4lfMrgLzd6AwHcZ08QqDUrqLvZwNyN8H5fh/5g8/cOEcPZYrhIAhyktUBlLna9/CqEMw9FIqzybEXxhfSUNiAJOF7z2eymgEN2TvfheD3Y1O73Fn4/2Tla3nk9ZcDO4FD4JW0AueBlvB6Af7AQ0/Bb+CH+GvxrDxqfG58aXmroQTtbcDWZG4+tvpRipKQ=</latexit> <latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit> Precision and recall ■ Recall : fraction of positive instances that were correct labels correctly classified. TP recall = TP + FN = ■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall. ■ Precision : fraction of positive predictions that were correct. TP precision = TP + FP = ■ The “never Telugu” classifier 0 precision. predicted labels 21

  82. <latexit sha1_base64="sCHI9VBUq0LibUzZDHZL9e36d8=">AE3XicfVPb9MwFE5GgVF+bXDkYpgmtWxU7UCy6QJEOICFGm/0Fwqx3Fab3Yc2c62yPKRG+LKf8Wd/4MriJekhXUrWEr8r3Pfp8/v0SZ4MZ2u9/DhUuNy1euLl5rXr9x89btpeU7u0blmrIdqoTS+xExTPCU7VhuBdvPNCMyEmwvOnpR5veOmTZcpdu2yNhAklHKE06JBWi4HI5WcSQdtmNmif/oWnat1/YIC5ZYorU6QefSkFyrsMS3yum0RHnbr6OiDtro0fw8HhPrCt9urk4itIkw0SNJToeuQJinCEtix5QI98H7M3URprGy/6jaBj20Vfz9nuyPZe6HBdTIWsDZRFM4G3MP9dbRoa+S9XbDQyjOYzSlkK0dJqBGlEpTShbrv4YGKr96WYHPKA8pL02eR+2X4HBpdvpVgNdDHqTYCWYjP5weWGAY0VzyVJLBTHmoNfN7MARbTkVzDdxblhG6BEZsQMIUyKZGbiqJTxaBSRGidLwpBZV6NkVjkhjChkBszyCOZ8rwXm5g9wmzwaOp1luWUrQkukFWo7C8Uc7DCigICQjUHrYiOCRhioQvB2DNlxkwcMzt7ECoHziRV9RlJkYRvzVJ2QpWUJI0fOpwQyURs4TkwnqHTKN51mzHh/zExcOq1takKfW6w0H/EU7hl6vmr8WRimscXVu4lfMrgLzd6AwHcZ08QqDUrqLvZwNyN8H5fh/5g8/cOEcPZYrhIAhyktUBlLna9/CqEMw9FIqzybEXxhfSUNiAJOF7z2eymgEN2TvfheD3Y1O73Fn4/2Tla3nk9ZcDO4FD4JW0AueBlvB6Af7AQ0/Bb+CH+GvxrDxqfG58aXmroQTtbcDWZG4+tvpRipKQ=</latexit> <latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit> Precision and recall ■ Recall : fraction of positive instances that were correct labels correctly classified. TP recall = TP + FN = ■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall. ■ Precision : fraction of positive predictions that were correct. TP precision = TP + FP = ■ The “never Telugu” classifier 0 precision. predicted labels ■ The “always Telugu” classifier has 0.003 precision. 21

  83. Combining precision and recall 22

  84. Combining precision and recall ■ Inherent tradeoff between precision and recall. Choice is problem-specific. 22

  85. Combining precision and recall ■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives can be screened out later. 22

  86. Combining precision and recall ■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives can be screened out later. ■ The “beyond reasonable doubt” standard of U.S. criminal law implies a preference for high precision. 22

  87. <latexit sha1_base64="gluqdi4WIqSk7wgy9umHKkSkbMU=">AFHnicfVNLb9QwE6XBcryaAtHLoaq0i6UareA4FKoAFVcgEXqC9VL5DjOrls7jhynbWT5v3Dkl3BDXOHfMHn0kXbBUuLxzDeb8YzQSJ4avr9PzOtK+2r167P3ujcvHX7ztz8wt3tVGWasi2qhNK7AUmZ4DHbMtwItptoRmQg2E5w8Law7xwynXIVb5o8YSNJxjGPOCUGVP7CzPclHEiLzYQZ4r7arnk86DmEBYsM0VodoQtmMD4udZHrFtxoeU9t4zySuihJ9PteEKMzV2vs1RLaA1hoseSHPs2R5jHCEtiJpQI+8W5c3ERpqEy/4jaAz60m5+d6/uxzJyfQ4ykC5g1dKJOJtxBvGW070pjdZ2/D8F5iE6gBREtrWbARpRMI02o3Rw6+CDixkfXAELRKS+qPA07LEb/uDUtlqndOpWn+t4Z9eBe63z5xf7K/1yocvCoBYWvXoN/YXWCIeKZpLFhgqSpnuDfmJGlmjDqWCug7OUJYQekDHbAzEmkqUjW/aUQ0ugCVGkNHyxQaX2vIclMk1zGQCyKEF60VYop9n2MhO9HFkeJ5lhMa0CRZlARqGiQVHIWEjchAI1Ry4IjohUDQDbQwFPxdmwsQhM81EqBzZNCqjNygFEs6axeyIKilJHD6yOCKSizxkEcmEcRan0Yk8rTL4SFP0rpKx1WZOjAoBivNxzyGR4KhKSenqYZtYnD57+B3DN5Csw9A8FPCNDFKA5NqDBy8zRg/wIX4PySPT5EgNtOyJQFIpiBSlhsXTVQqUMB2OtsqRB+J/SRQuIBFUvMKzpluFgIYcXGy/y8L26srg6crq52eL62/q1pz17nsPva438F546957b+htebQ13retV63f7W/tH+2f5VQVsztc89r7Hav/8CIW+yA=</latexit> Combining precision and recall ■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives can be screened out later. ■ The “beyond reasonable doubt” standard of U.S. criminal law implies a preference for high precision. ■ Most often, we weight them equally using F 1 measure : harmonic mean of precision and recall. F 1 = 2 · precision · recall precision + recall 22

  88. <latexit sha1_base64="5hf/30datH8Q35WsniBmRcgzYK4=">AFzXichVRLb9NAEHYDgRJeLRy5LFSVbBqJCDBpVIFqOICDVJfqBui9XqdbOu1zXrd1lqWK/+K/8GdK/wGxo+0cZqCJcezM9/MfDuPuHAE9Xp/FxoXLvevHFz8Vbr9p279+4vLT/YS6JUrZLoyCSBy5JWMBDtqu4CthBLBkRbsD23eM3uX3/hMmER+GOymI2EGQUcp9TokA1XG7sr2JXaKzGTBHzWdtqresYhAPmKyJldIpmzGBcK3S+sfPWa7ljmjrBQc9Gy+HY+J0plxWquVhDYQJnIkyNlQZwjzEGFB1JiSQH8yZiovwtSL1BVZHeBD7eziXMXHIjXDHLENmA20EQdj7mBfG10ZApjGW54BMm5hybQnIgUWjJgExRMfUmo3ukbeCHj1gdTA0LRKc+rPA/bL7Fbw+65sVfd6dyvOlcJL+KBf6XLA2AX6gFB7G7RBTh87jlxKsi2RVsNqEzFXlyC8FDY58j2pXdycfhC8rpF0KviPQfn+HSme9UzostCthBWrevowigPsRTQVLFQ0IEly2O3EaqCJVJwGDEimCYsJPSYjdghiSARLBrpYAYNWQeMhP5LwhgoV2mkPTUSZMIFZM47mbXlynm2w1T5rwah3GqWEjLRH4aIBWhfJ+Qx+HCKshAIFRy4IromEBDFGwdzMdUmjELTpiqX4SKgU78InuNkivgLFnITmkBAm9pxr7RPAg85hP0kAZjRN/Is8rTds74XFSVemsLFML+qdwJPmIh9Ak2PFi0etq+IwVLn5b+C2DXkj2Hghux0wSFUlgUm6tgd6M8GOci/9CwoBMkCDWr6ULAnCZvARzEJtyj+BIEoYdkcySuMa4Uv+BVEIQHyoeIlndbcSAQPZnR2/y8Jeb737fL38cXK5utqNBetR9YTy7a61ktr03pn9a1dizZ+NH41fjf+NLebafNr81sJbSxUPg+t2tP8/hc2sv2L</latexit> <latexit sha1_base64="gluqdi4WIqSk7wgy9umHKkSkbMU=">AFHnicfVNLb9QwE6XBcryaAtHLoaq0i6UareA4FKoAFVcgEXqC9VL5DjOrls7jhynbWT5v3Dkl3BDXOHfMHn0kXbBUuLxzDeb8YzQSJ4avr9PzOtK+2r167P3ujcvHX7ztz8wt3tVGWasi2qhNK7AUmZ4DHbMtwItptoRmQg2E5w8Law7xwynXIVb5o8YSNJxjGPOCUGVP7CzPclHEiLzYQZ4r7arnk86DmEBYsM0VodoQtmMD4udZHrFtxoeU9t4zySuihJ9PteEKMzV2vs1RLaA1hoseSHPs2R5jHCEtiJpQI+8W5c3ERpqEy/4jaAz60m5+d6/uxzJyfQ4ykC5g1dKJOJtxBvGW070pjdZ2/D8F5iE6gBREtrWbARpRMI02o3Rw6+CDixkfXAELRKS+qPA07LEb/uDUtlqndOpWn+t4Z9eBe63z5xf7K/1yocvCoBYWvXoN/YXWCIeKZpLFhgqSpnuDfmJGlmjDqWCug7OUJYQekDHbAzEmkqUjW/aUQ0ugCVGkNHyxQaX2vIclMk1zGQCyKEF60VYop9n2MhO9HFkeJ5lhMa0CRZlARqGiQVHIWEjchAI1Ry4IjohUDQDbQwFPxdmwsQhM81EqBzZNCqjNygFEs6axeyIKilJHD6yOCKSizxkEcmEcRan0Yk8rTL4SFP0rpKx1WZOjAoBivNxzyGR4KhKSenqYZtYnD57+B3DN5Csw9A8FPCNDFKA5NqDBy8zRg/wIX4PySPT5EgNtOyJQFIpiBSlhsXTVQqUMB2OtsqRB+J/SRQuIBFUvMKzpluFgIYcXGy/y8L26srg6crq52eL62/q1pz17nsPva438F546957b+htebQ13retV63f7W/tH+2f5VQVsztc89r7Hav/8CIW+yA=</latexit> Combining precision and recall ■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives can be screened out later. ■ The “beyond reasonable doubt” standard of U.S. criminal law implies a preference for high precision. ■ Most often, we weight them equally using F 1 measure : harmonic mean of precision and recall. F 1 = 2 · precision · recall min( precision , recall ) ≤ F 1 ≤ 2 · min( precision , recall ) precision + recall 22

  89. <latexit sha1_base64="5hf/30datH8Q35WsniBmRcgzYK4=">AFzXichVRLb9NAEHYDgRJeLRy5LFSVbBqJCDBpVIFqOICDVJfqBui9XqdbOu1zXrd1lqWK/+K/8GdK/wGxo+0cZqCJcezM9/MfDuPuHAE9Xp/FxoXLvevHFz8Vbr9p279+4vLT/YS6JUrZLoyCSBy5JWMBDtqu4CthBLBkRbsD23eM3uX3/hMmER+GOymI2EGQUcp9TokA1XG7sr2JXaKzGTBHzWdtqresYhAPmKyJldIpmzGBcK3S+sfPWa7ljmjrBQc9Gy+HY+J0plxWquVhDYQJnIkyNlQZwjzEGFB1JiSQH8yZiovwtSL1BVZHeBD7eziXMXHIjXDHLENmA20EQdj7mBfG10ZApjGW54BMm5hybQnIgUWjJgExRMfUmo3ukbeCHj1gdTA0LRKc+rPA/bL7Fbw+65sVfd6dyvOlcJL+KBf6XLA2AX6gFB7G7RBTh87jlxKsi2RVsNqEzFXlyC8FDY58j2pXdycfhC8rpF0KviPQfn+HSme9UzostCthBWrevowigPsRTQVLFQ0IEly2O3EaqCJVJwGDEimCYsJPSYjdghiSARLBrpYAYNWQeMhP5LwhgoV2mkPTUSZMIFZM47mbXlynm2w1T5rwah3GqWEjLRH4aIBWhfJ+Qx+HCKshAIFRy4IromEBDFGwdzMdUmjELTpiqX4SKgU78InuNkivgLFnITmkBAm9pxr7RPAg85hP0kAZjRN/Is8rTds74XFSVemsLFML+qdwJPmIh9Ak2PFi0etq+IwVLn5b+C2DXkj2Hghux0wSFUlgUm6tgd6M8GOci/9CwoBMkCDWr6ULAnCZvARzEJtyj+BIEoYdkcySuMa4Uv+BVEIQHyoeIlndbcSAQPZnR2/y8Jeb737fL38cXK5utqNBetR9YTy7a61ktr03pn9a1dizZ+NH41fjf+NLebafNr81sJbSxUPg+t2tP8/hc2sv2L</latexit> <latexit sha1_base64="gluqdi4WIqSk7wgy9umHKkSkbMU=">AFHnicfVNLb9QwE6XBcryaAtHLoaq0i6UareA4FKoAFVcgEXqC9VL5DjOrls7jhynbWT5v3Dkl3BDXOHfMHn0kXbBUuLxzDeb8YzQSJ4avr9PzOtK+2r167P3ujcvHX7ztz8wt3tVGWasi2qhNK7AUmZ4DHbMtwItptoRmQg2E5w8Law7xwynXIVb5o8YSNJxjGPOCUGVP7CzPclHEiLzYQZ4r7arnk86DmEBYsM0VodoQtmMD4udZHrFtxoeU9t4zySuihJ9PteEKMzV2vs1RLaA1hoseSHPs2R5jHCEtiJpQI+8W5c3ERpqEy/4jaAz60m5+d6/uxzJyfQ4ykC5g1dKJOJtxBvGW070pjdZ2/D8F5iE6gBREtrWbARpRMI02o3Rw6+CDixkfXAELRKS+qPA07LEb/uDUtlqndOpWn+t4Z9eBe63z5xf7K/1yocvCoBYWvXoN/YXWCIeKZpLFhgqSpnuDfmJGlmjDqWCug7OUJYQekDHbAzEmkqUjW/aUQ0ugCVGkNHyxQaX2vIclMk1zGQCyKEF60VYop9n2MhO9HFkeJ5lhMa0CRZlARqGiQVHIWEjchAI1Ry4IjohUDQDbQwFPxdmwsQhM81EqBzZNCqjNygFEs6axeyIKilJHD6yOCKSizxkEcmEcRan0Yk8rTL4SFP0rpKx1WZOjAoBivNxzyGR4KhKSenqYZtYnD57+B3DN5Csw9A8FPCNDFKA5NqDBy8zRg/wIX4PySPT5EgNtOyJQFIpiBSlhsXTVQqUMB2OtsqRB+J/SRQuIBFUvMKzpluFgIYcXGy/y8L26srg6crq52eL62/q1pz17nsPva438F546957b+htebQ13retV63f7W/tH+2f5VQVsztc89r7Hav/8CIW+yA=</latexit> <latexit sha1_base64="OEijKjUDTt9md5BWbXclwVw2qQ=">AFeHicfVNLb9QwE5DF8ryauHIxVBVbKBUmwUJLpUqQBUXoEh9oXq7chxn162dRI7TNrL8G7jCT+OvcGLy6CPbLZYSj2e+mfk8nglSwTPd7/+Zc2/Nd27fWbjbvXf/wcNHi0uPd7MkV5Tt0EQkaj8gGRM8Zjua8H2U8WIDATbC4/lva9E6YynsTbukjZUJxzCNOiQbVaMl1V3AgDdYTpok9ND39yvcswoJFmiVnKIpMxhfVbrI9srtrNRyz6iohY89Hq2HU+INoX1uiuNhNYRJmosydnIFAjzGFJ9IQSYX5YeyUvwjRM9A1ZPeBDe8XluYmPZW5HBeRIe4BZR+fqdMIt5FtFR7Yy1uFGR5Cch+gcWhJR0igGbETFNFKEmu0tCx9k3PxqW0AoOuVlWdht2rs5si/MA6aO134Necm4WU8G903c0RDqAcEKPnV48Ah8OBVwe8KVCvgU3n8y4DjxaX+2v9aqHrgt8Iy06ztqBrhjhMaC5ZrKkgWXbg91M9NERpTgWzXZxnLCX0mIzZAYgxkSwbmqpbLVoBTYiRMEXa1Rpr3oYIrOskAEgy+Jm07ZSOct2kOvo/dDwOM01i2mdKMoF0gkqWx+FHC6sRQECoYoDV0QnBIqnYUDgKa+kmTBxwnT7IlQOTRZV2VuUAglnxWJ2ShMpSRy+NDgikosiZBHJhbYGZ9G5PKs0q+EJT7OmSmd1mbowghonio95DI8E41jNZFsN20Tj6t/Fnxi8hWJfgOC3lCmiEwVM6gGz8DZj/AyX4v+QPL5Agti+lqkIwGXKEiQpi42t51UkGcPBWCV52iJ8zb8iCgFIBWv8aztViOgIf3p9rsu7A7W/Ddrg+9vlzc+NK254Dx1njs9x3feORvOZ2fL2XGoy92f7i/39/zfDuq86Hg1J1rfJ4rdUZ/APUHdxF</latexit> Combining precision and recall ■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives can be screened out later. ■ The “beyond reasonable doubt” standard of U.S. criminal law implies a preference for high precision. ■ Most often, we weight them equally using F 1 measure : harmonic mean of precision and recall. F 1 = 2 · precision · recall min( precision , recall ) ≤ F 1 ≤ 2 · min( precision , recall ) precision + recall ■ Can generalize F-measure to adjust the tradeoff, such that recall is β -times as important as precision: precision · recall F β = (1 + β 2 ) ( β 2 · precision ) + recall 22

  90. Trading off precision and recall: ROC curve 23

  91. Trading off precision and recall: ROC curve 23

Recommend


More recommend