l101 introduction to structured prediction
play

L101: Introduction to Structured Prediction Ryan Cotterell What is - PowerPoint PPT Presentation

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction? Its just multi-class classification! Definition: Structured Something in the problem is exponentially large Definition: Structured


  1. L101: Introduction to Structured Prediction Ryan Cotterell

  2. What is structured prediction? • It’s just multi-class classification! • Definition: Structured • Something in the problem is exponentially large • Definition: Structured Prediction: • The output space of the prediction problem is exponentially large

  3. <latexit sha1_base64="hNljSnMJNJYiU+LQK4eaDv9/dw=">ACVHicbVHLSgMxFM2Mr1pfVZdugkWsIGWq4mMhFNy4VLA+aErJpHc0NJMZkox0CPORuhD8EjcuzNQivg4EDufec5N7EqaCaxMEr54/NT0zO1eZry4sLi2v1FbXrnWSKQYdlohE3YZUg+ASOoYbAbepAhqHAm7C4VlZv3kEpXkir0yeQi+m95JHnFHjpH5tmDZyTGI+wKMdfIpJpCizBEYpscTAyE20miUKika+61pIUVis7hv821MuHRWah4YFfauKPD/vu3SiJ2zX6sHzWAM/Je0JqSOJrjo157JIGFZDNIwQbXutoLU9CxVhjMBRZVkGlLKhvQeuo5KGoPu2XEoBd5ygBHiXJHGjxWvzsjbXO49B1ljvo37VS/K/WzUx03LNcpkByT4vijKBTYLhPGAK2BG5I5Qprh7K2YP1OVq3D9UxyGclDj8Wvkvud5rtvab+5cH9fbdJI4K2kCbqIFa6Ai10Tm6QB3E0BN685DneS/euz/lz3y2+t7Es45+wF/+AFpctBQ=</latexit> Recall Logistic Regression We will define this later • Goal : to construct a probability distribution exp { score ( y, x ) } p ( y | x ) = P y 0 2 Y exp { score ( y 0 , x ) } • The major question: What if |Y| is really, really big? • Can we find an efficient algorithm for computing that sum?

  4. <latexit sha1_base64="BdJPE7QsldfNtL3wbHY3Oy6AGpk=">AB+nicbVDLSsNAFL2pr1pfqS7dDBbBVUlb8bEQim5cVrAPaUOZTCft0MkzEyUkvZT3LhQxK1f4s6/MUmDqPXAwOGce7lnjhNwprRlfRq5peWV1bX8emFjc2t7xyzutpQfSkKbxOe+7DhYUc4EbWqmOe0EkmLP4bTtjK8Sv31PpWK+uNWTgNoeHgrmMoJ1LPXN4rTnYT0imEd3sym6QNW+WbLKVgq0SCoZKUGRt/86A18EnpUaMKxUt2KFWg7wlIzwums0AsVDTAZ4yHtxlRgjyo7SqP0GsDJDry/gJjVL150aEPaUmnhNPJjnVXy8R/O6oXbP7IiJINRUkPkhN+RI+yjpAQ2YpETzSUwkSzOisgIS0x03FYhLeE8wcn3lxdJq1qu1Mq1m+NS/TKrIw/7cABHUIFTqM1NKAJB7gEZ7hxZgaT8ar8TYfzRnZzh78gvH+BaYUk7o=</latexit> <latexit sha1_base64="sEHNemEVkNeD5/hsbIKHubHJoL0=">AB+nicbVDLSgMxFL1TX7W+Wl26CRbBVZmq+FgIRTcuK9iHtEPJpJk2NJMZkoxSpv0UNy4UceuXuPNvzEwHUeuBwOGce7knxw05U9q2P63cwuLS8kp+tbC2vrG5VSxtN1UQSUIbJOCBbLtYUc4EbWimOW2HkmLf5bTljq4Sv3VPpWKBuNXjkDo+HgjmMYK1kXrF0qTrYz0kmMd30wm6QEYr2xU7BZon1YyUIUO9V/zo9gMS+VRowrFSnaodaifGUjPC6bTQjRQNMRnhAe0YKrBPlROn0ado3yh95AXSPKFRqv7ciLGv1Nh3zWSU/31EvE/rxNp78yJmQgjTQWZHfIijnSAkh5Qn0lKNB8bgolkJisiQywx0atQlrCeYKT7y/Pk+ZhpXpUObo5LtcuszrysAt7cABVOIUaXEMdGkDgAR7hGV6sifVkvVpvs9Gcle3swC9Y718BE5P2</latexit> <latexit sha1_base64="VdY3pceih/fCEJChDbG3fkgqc=">AB/HicbVDLSgMxFL3js9ZXtUs3wSK4KjOt+FgIRTcuK9iHtGPJpJk2NPMgyQjDtP6KGxeKuPVD3Pk3ZqZF1HogcDjnXu7JcULOpDLNT2NhcWl5ZTW3l/f2NzaLuzsNmUQCUIbJOCBaDtYUs582lBMcdoOBcWew2nLGV2mfueCskC/0bFIbU9PCZywhWuoViuOuh9WQYJ7cTsboHFXutFoy2YGNE+sGSnBDPVe4aPbD0jkUV8RjqXsWGao7AQLxQink3w3kjTEZIQHtKOpjz0q7SQLP0EHWukjNxD6+Qpl6s+NBHtSxp6jJ9Ok8q+Xiv95nUi5p3bC/DBS1CfTQ27EkQpQ2gTqM0GJ4rEmAimsyIyxAITpfvKZyWcpTj+/vI8aVbKVrVcvT4q1S5mdeRgD/bhECw4gRpcQR0aQCGR3iGF+PBeDJejbfp6Ix2ynCLxjvXzJklJo=</latexit> Structured Prediction in a Meme Sentiment Analysis: |Y| = 2 Is sentiment positive or negative? Movie Genre Prediction: |Y| = n Which genre is this script? Part-of-Speech Tagging: |Y| = 2 n This sentence has which part-of-speech-tag sequence?

  5. Predict Trees! • Predict dependency parses from raw text • Classic problem in NLP

  6. Predict Subsets! • Determinantal Point Processes • A distribution over subsets

  7. Why isn’t Structured Prediction just Statistics? • Computer scientists develop combinatorial algorithms professionally • Minimum spanning tree, shortest path problems, maximum flow, LP relaxations • Structured prediction is the intersection of algorithms and high- dimensional statistics (Theoretical) Statistics Computer Science

  8. Deep Dive into Discriminative Tagging • Assign each word in a sentence a coarse-grained grammatical category • Noun, Verb, Adjective, Adverb, Determiner, etc… • Arguably, the simplest structured prediction problem in NLP

  9. Back in 2001…

  10. <latexit sha1_base64="1OPlgygso5+F1juU8nAq34DA=">ACPnicbVDLahsxFNU4bR5umzjJMhtRU3ChmHEc8tgEQzdZulA/wGOMRr6TCGtGQrqTxAzZdnkG7rspsEkK3XVZju3Vb94Dg6Nz3CbUFn3/q1dae/FyfWNzq/zq9ZvtncruXteq1HDocCWV6YfMghQJdFCghL42wOJQi+cfCzivWswVqjkM041DGN2mYhIcIZOGlU6uhbEDK/CKMOcBrEY01/m/w9DbRGhUN4FYHWYBw62ZklisDeW2Z+IEum7ifFSp+nV/BrpKGgtSJQu0R5UvwVjxNIYEuWTWDhq+xmHGDAouIS8HqQXN+IRdwsDRhMVgh9ns/Jy+c8qYRsq4lyCdqX9WZCy2dhqHLrPY0v4bK8T/xQYpRqfDTCQ6RUj4fFCUSuoMKbykY2GAo5w6wrgRblfKr5hHJ3j5ZkJZwWOf5+8SrqH9Uaz3vx0VG2dL+zYJAfkLamRBjkhLXJB2qRDOLkj38gjefLuvQfv2fs+Ty15i5p98he8Hz8BTYSxHQ=</latexit> <latexit sha1_base64="9KvgiK/zWN+qKGLCTC8oO5S57g=">ACVnicfVFdSxwxFM2Mn10/OupjX4JL0UJZmup+iAIvio0FXLzrJmsnd2g5lkSO6oyzB/Ul/an+KLmFnH4hceCBzOuYfcnMSZFBbD8J/nT03PzM7Nf2osLC4tfw5WVk+szg2HDtdSm7OYWZBCQcFSjLDLA0lnAaXxU/uklGCu0+o3jDHopGyqRCM7QSf0gjRCuXa6wXBsoIwkJbkYpw1GcFfld/rEsYyMGI7wG92jT5nz86HWAwXWbmx8GKV1th80w1Y4AX1L2jVpkhpH/eAmGmiep6CQS2Ztx1m2CuYQcElI0ot5AxfsG0HVUsRsr5jUtKvThnQRBt3FNKJ+jxRsNTacRq7yWpT+9qrxPe8bo7JTq8QKsRFH+8KMklRU2rjulAGOAox4wboTblfIRM4yj+4nGpITdCr/+P/ktOfnRam+1to5/Nvf/1HXMky9knWySNtkm+SQHJEO4eSW3Hm+N+X9e79GX/ucdT36swaeQE/eAUdrZ/</latexit> What is a score function for tagging? • Arbitrary function that takes a word sequence and a tag as input and tell you how good they are together score ( w , t ) = “goodness” ( w , t ) p ( t | w ) ∝ exp { score ( w , t ) }

  11. <latexit sha1_base64="jU6959q+oSYD68OzEntITv/8EU=">ACX3icdVFda9swFJW9dkmTLPO2p7EXsTBoQRnLd32UAjsZY8dNE1HFIsXyeismWk62zB+E/2rdCX/ZPJaVr6tQNCh3Pula6OolxJi2F45fkvtrZfNpo7rXbnVfd18ObtmdWFETASWmlzHnELSmYwQokKznMDPI0UjKOL7U/XoKxUmenuMphmvJ5JhMpODpFiwZwh/XV1qhDVRMQYK7LOW4iJLyd7VPbzlWzMj5AvfoMS1ZpFVsV6nbKMFIK8oE7HGh1ZS/eovVnQC/vhGvQpGWxIj2xwMgsuWaxFkUKGQnFrJ4Mwx2nJDUqhoGqxwkLOxQWfw8TRjKdgp+U6n4p+ckpME23cypCu1fsdJU9tPbKrEe0j71afM6bFJh8nZYywuETNxclBSKoqZ12DSWBgSqlSNcGOlmpWLBDRfovqS1DuFbjaO7Jz8lZ5/7g4P+wc/D3vDXJo4m+UA+kl0yIF/IkPwgJ2REBLn2fK/tdby/fsPv+sFNqe9tet6RB/Df/wPNG7hI</latexit> <latexit sha1_base64="AFd/hOApGgbQmNSrOhsHbpkCWNY=">ACTHicdVDLSiNBFK2O74wzRl26KQyCghM6Kj4WguDGVYhgfJAOobpyOylS3dVU3VZD0x/oxoU7v8KNC2UYsDpG8TFzoOBwzj3cW8ePpTDouvdOYWx8YnJqeqb4Y/bnr7nS/MKpUYnm0OBKn3uMwNSRNBAgRLOYw0s9CWc+f3D3D+7BG2Eik5wEMrZN1IBIztFK7xD2Ea5tLDVcaMk9CgKteyLDnB+lVtk7fOGaeFt0ertF9+papQaKZ/F0DvFK6n/0nt9Yuld2KOwT9TqojUiYj1NulO6+jeBJChFwyY5pVN8ZWyjQKLiEreomBmPE+60LT0oiFYFrpsIyMrlilQwOl7YuQDtWPiZSFxgxC307mJ5qvXi7+y2smGOy2UhHFCULEXxcFiaSoaN4s7QgNHOXAEsa1sLdS3mOacbT9F4cl7OXYfv/yd3K6UaluVjaPt8oHF6M6pskSWSarpEp2yAE5InXSIJzckAfyRJ6dW+fR+eP8fR0tOKPMIvmEwuQLNxi2hA=</latexit> You score function can be any function! Linear Function (dot product of a weight vector and a feature function) score ( w , t ) = θ · f ( w , t ) score ( w , t ) = Neural-Network ( w , t ) Non-linear Function (neural network)

  12. n u N N N N N o N b r e V V V V V V score ( w , N, V, A, D, N) v d A A A A A A t e D D D D D D w Time flies like an arrow

  13. <latexit sha1_base64="s9jhtMHKXTsl+NMqkQHCvDAU5ik=">ACT3icbVHLThsxFPUEWkL6CnTJxmpUNZWqaAIVjwUSg1LKhFAjdPI49wBC49nZN+BRNb8IZuy4zfYsABV9YShSkuPZOn43KePo0xJi2F4E9Tm5l+8XKgvNl69fvP2XNp+cimuRHQE6lKzUnELSipoYcSFZxkBngSKTiOzvfK+PEFGCtTfYiTDAYJP9UyloKjl4bN+HubJRzPothdFp/pNmU2T4buScPiE2VS0+ldcOUOix+6oAzGXMYexHOitSA0V7puYLnWnKCtoYNlthJ5yCPifdirRIhYNh85qNUpEnoFEobm2/G2Y4cNygFAqKBstZFyc81Poe6p5Anbgpn4U9KNXRjROjT8a6VSdrXA8sXaSRD6zXNP+GyvF/8X6OcabAyd1liNo8TgozhXFlJbm0pE0IFBNPOHCSL8rFWfcIH+Cx5N2Cqx/ufJz8nRaqe71ln79rW1s1vZUScr5ANpky7ZIDtknxyQHhHkitySe/IQ/Azugl+1KrUWVOQ9+Qu1xd9WfbVf</latexit> <latexit sha1_base64="iUmWof06X3MD0EAZxpNQlCf89Pk=">ACnicbVDLSgMxFL3js9ZX1aWbaBFclakVHwuh4MZlhb6knZMmrahmcyQZIQy7dqNv+LGhSJu/QJ3/o2ZdpBqPRA4nHMufe4AWdK2/aXtbC4tLymlpLr29sbm1ndnaryg8loRXic1/WXawoZ4JWNOc1gNJsedyWnMH17Ffu6dSMV+U9TCgjod7gnUZwdpI7czBqOlh3SeYR+VxS4zQFZpVRi2Tydo5ewI0T/IJyUKCUjvz2ez4JPSo0IRjpRp5O9BOhKVmhNxuhkqGmAywD3aMFRgjyonmpwyRkdG6aCuL80TGk3U2YkIe0oNPdck4y3VXy8W/Maoe5eOBETQaipINOPuiFH2kdxL6jDJCWaDw3BRDKzKyJ9LDHRpr30pITLGc/J8+T6kuX8gVbk+zxbukjhTswyEcQx7OoQg3UIKEHiAJ3iBV+vRerberPdpdMFKZvbgF6yPb/1m0w=</latexit> <latexit sha1_base64="4bvyp1Do5BRki97fWNMj1s7m/k=">ACyXicfVFLbxMxEPZueZTwCnDkYhGhthKNi2icECqQEJIPbRITVspTiOvM5tY9Xq39myb4O6Jf8iNGz8Fb3YpKUWMZOnzHzfvOJcSYtR9CMIV27dvnN39V7r/oOHjx63nzw9tFlhBPRFpjJzHMLSmro0QFx7kBnsYKjuLTj1X86ByMlZk+wHkOw5RPtEyk4Ohdo/bPnClIcJ2lHKdx4rCkLJVj+vt/UTIjJ1PcoO8pSwXjsEsZ4hzHw5Z0VmoFyiv1qiblBWlo7ZIh25PxlrlEldZwmu3EF5on3R/6uXZP1q0r/l5ZT3C5pHh5opu2W6N2J+pGC6M3Qa8BHdLY/qj9nY0zUaSgUShu7aAX5Th03KAUCnzhwkLOxSmfwMBDzVOwQ7e4RElfes+YJpnxTyNdeJcZjqfWztPYZ1bd2r9jlfNfsUGByduhkzovELSoCyWFopjR6qx0LA0IVHMPuDS90rFlPt7oT9+vYR3lb25GvkmONzs9ra6W19ed3Y+NOtYJc/JC7JOemSb7JDPZJ/0iQg+BSogvNwNzwLZ+HXOjUMGs4zcs3Cb78AqKTj1w=</latexit> <latexit sha1_base64="OpDjQ/QvtoXNYqTR8mHTmDbLJI=">ACEHicbVDLSsNAFJ3UV62vqEs3wSLWTUmt+NgV3LizQl/S1DKZTtqhk0mYuRFKmk9w46+4caGIW5fu/BuTtBS1HrhwOde7r3H9jlTYJpfWmZhcWl5JbuaW1vf2NzSt3caygskoXicU+2bKwoZ4LWgQGnLV9S7NqcNu3hZeI376lUzBM1GPm04+K+YA4jGKpqx9aLoYBwTy8jixOHSiMZ0otGt8JS7L+AI6et4smimMeVKakjyaotrVP62eRwKXCiAcK9UumT50QiyBEU6jnBUo6mMyxH3ajqnALlWdMH0oMg5ipWc4noxLgJGqPydC7Co1cu24MzlW/fUS8T+vHYBz3gmZ8AOgkwWOQE3wDOSdIwek5QAH8UE8niWw0ywBITiDPMpSFcJDidvTxPGsfFUrlYvjnJV26ncWTRHtpHBVRCZ6iCrlAV1RFBD+gJvaBX7VF71t6090lrRpvO7KJf0D6+ARNxnhQ=</latexit> How do we normalize? • Why is this hard? • There are an exponential number of summands exp { score ( t , w ) } p ( t | w ) = t 0 2 T n exp { score ( t 0 , w ) }O ( |T | n ) P • T is the set of tags (typically about 12) • |T n | = |T | n • Naïve algorithm runs in O ( |T | n ) • The normalizer is terms the partition function (Zustandssumme) X exp { score ( t 0 , w ) } Z ( w ) = t 0 2 T n

Recommend


More recommend