• Innovation through disruptive and scalable technology .
  • Cutting-edge AI research .
  • Accelerating innovations in research and service .
  • We strive for (and achieve) excellence! .
  • “SotA” (State-of-the-Art) .
  • Visual demo of research and service innovation
  • Human. Machine. Experience Together .

KoBERT, AI Language Model which understands Korean

2019.10.10

Bidirectional Encoder Representations for Transformers (BERT) developed by Google foreshadows a new era in language understanding by advancing the performance of existing AI technologies applied transfer learning.

SK T-Brain developed KoBERT to overcome the performance limit of the Korean language observed when using BERT. KoBERT trained a large corpus composed of millions of Korean sentences collected from Wikipedia and news, and to address issues from the irregular characteristics of the Korean language, we applied data-driven tokenization and achieved a 2.6% performance improvement using only 27% tokens compared to the previous methods.

KoBERT utilizes Ring-reduce Distributed Training to quickly process over 1 billion sentences through numbers of machines. In addition, by supporting various deep learning frameworks including PyTorch, TensorFlow, ONNX, and MXNet, KoBERT will contribute to the development of AI language understanding services in diverse fields.

KoBERT is used in various services within SK Telecom. First, to assist agents in SK Telecom’s call center, KoBERT was applied in the chatbot system. Furthermore, it is also applied in AI legal and patent document searching service and we recently obtained a patent for this technology named “A method of generating a context sensitive document-level vector and a similar document recommendation using the method.” It is also used as a core model of machine reading comprehension technology that extracts accurate answers from the vast marketing materials inside SK Telecom.

Currently, KoBERT is available on Github(https://github.com/SKTBrain/KoBERT) and we will continually update it for further use by many researchers and program developers in Korea.