a simple and robust a simple and robust algorithm for
play

A simple and robust A simple and robust algorithm for extracting - PowerPoint PPT Presentation

A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology Lu s Sarmento s Sarmento Lu Linguateca Linguateca www.linguateca.pt / / las@letras.up.pt las@letras.up.pt


  1. A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology Luí ís Sarmento s Sarmento Lu Linguateca Linguateca www.linguateca.pt / / las@letras.up.pt las@letras.up.pt www.linguateca.pt Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  2. ���������� ���������� � Exponential growth of multi Exponential growth of multi- -lingual written lingual written � information, especially in ����������������� information, especially in ����������������� � Need for Need for ��������������������� ��������������������� � � Information Retrieval Information Retrieval � � Technical Writing Technical Writing � � Translation Translation � � But But ������������������� ������������������� is constantly evolving and is constantly evolving and � so is its ����������� . so is its ����������� . Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  3. ���������� ���������� � Terminology resources Terminology resources � � Short life Short life- -cycles, constant need for update cycles, constant need for update � � Expensive to produce and maintain Expensive to produce and maintain � � Need to keep up with emergent domains Need to keep up with emergent domains � � What we need: What we need: � � ��������������������������������������� ��������������������������������������� � � Easy Easy- -to to- -use terminology extraction software use terminology extraction software � � Computing Computing- -aware terminology specialists aware terminology specialists � � “ “Build & Go Build & Go” ” terminology resources terminology resources � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  4. ����������� ����������� � � Obtain a specific domain corpus Obtain a specific domain corpus 1. 1. “Do Do- -it it- -yourself yourself” ” / web search / specialist / web search / specialist “ � � Extract terminology (semi- -automatically) automatically) Extract terminology (semi 2. 2. Validate results using corpora Validate results using corpora 3. 3. Consult specialist, if possible... Consult specialist, if possible... � � Use terminology for IR, Translation, etc... Use terminology for IR, Translation, etc... 4. 4. IF/ WHEN more terminology resources are IF/ WHEN more terminology resources are 5. 5. necessary, go back to Step 1 necessary, go back to Step 1 Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  5. ������������������������ ������������������������ � Statistical Statistical � � Rationale: find word sequences that differ from Rationale: find word sequences that differ from “ “common common- - � language” ” language � Simple and portable but requires Simple and portable but requires “ “common common- -language language” ” corpus corpus � for comparison: ��������� for comparison: ��������� ! ! � Syntactic Syntactic � � Rationale: Find word sequences that have a specific POS Rationale: Find word sequences that have a specific POS � pattern pattern � Good precision and coverage, but complex and requires Good precision and coverage, but complex and requires � . Difficult to port to other languages. ���������� . Difficult to port to other languages. ������ ������ � ��� ���� ����������� � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  6. ������������������������ ������������������������ � Morphological: Morphological: � � Rationale: find words that look like terms based on Rationale: find words that look like terms based on � roots or suffixes. roots or suffixes. � Good precision for Good precision for ���� ���� domains but requires domains but requires ������ ������ � . ��������������������� . ��������������������� � Hybrid: Hybrid: � � Rationale: try to combine any of the previous Rationale: try to combine any of the previous � approaches and use other heuristics approaches and use other heuristics � May lead to good results but usually lacks May lead to good results but usually lacks ������������ ������������ � ���������� ���������� Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  7. �������������������� �������������������� � The situation: The situation: � � Large amounts of text available on Large amounts of text available on- -line line � � High High ���������� ���������� – – should be explored! should be explored! � � Multi Multi- -lingual corpora (comparable, not parallel) lingual corpora (comparable, not parallel) � � What is required: What is required: � algorithms ����� algorithms � ����� � � Large amounts of text to be processed Large amounts of text to be processed � � High High ��������� ��������� algorithms algorithms � � High coverage comes from redundancy High coverage comes from redundancy � � “ “ ����� ” algorithms algorithms ������� ” ������ �������� � � Easy to port to other languages: spare the programmers! Easy to port to other languages: spare the programmers! � � � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  8. ���������������� ���������������� � We still need human intervention We still need human intervention � � at least domain specialists for validation at least domain specialists for validation � � “ “Fully automated Fully automated” ” methods are never fully methods are never fully � automated automated � Human intervention in resource building is Human intervention in resource building is � advisable and feasible advisable and feasible � But it cannot be too difficult/ boring But it cannot be too difficult/ boring � ��������� is more important than coverage! is more important than coverage! � ��������� � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  9. ������������������������������ ������������������������������ � The Corp The Corpó ógrafo is a complete web grafo is a complete web- -based terminology based terminology � extraction environment. extraction environment. � We assume user intervention: We assume user intervention: � � the the “ “need for speed need for speed” ” � � good precision good precision � � easy to understand! easy to understand! � � Need to perform reasonably well in many languages. Need to perform reasonably well in many languages. � � We cannot afford POS tagging: We cannot afford POS tagging: � � too complex, too slow, too expensive, too dependent too complex, too slow, too expensive, too dependent � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  10. ���������������������� ���������������������� � Collect N Collect N- -grams from the corpus grams from the corpus � � Ask user to check if they are terms. Ask user to check if they are terms. � � Advantages: Advantages: � � No linguistic resources needed No linguistic resources needed � � Fast and portable Fast and portable � � Disadvantages Disadvantages � � Too noisy Too noisy � � Users obviously find it inappropriate Users obviously find it inappropriate � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

Recommend


More recommend