College Of Engineering Seoul National University

News

A Research Team Led by Professor Sungroh Yoon of SNU College of Engineering Announced Artificial Intelligence Training Datasets for Using Better Hangul

Uploaded by

관리자
Upload Date

2021.10.19
Views

644

A Research Team Led by Professor Sungroh Yoon
of SNU College of Engineering Announced Artificial Intelligence Training Datasets for Using Better Hangul

- Data set for social bias diagnosis of Korean language models.
- Expected to be a meaningful first step towards the development of artificial intelligence that is capable of using correct Hangul

▲ Professor Sungroh Yoon's research team of Seoul National University that led the research
(Image Source: Seoul National University)

Seoul National University College of Engineering (Dean Byoungho Lee) announced on October 8 that Professor Sungroh Yoon's team of the Department of Electrical and Computer Engineering will release K-StereoSet, a dataset for diagnosing social bias in Korean language models, on the 9th (Saturday) in celebration of Hangul Day.

The social bias of artificial intelligence is emerging as an important worldwide keyword for the future of artificial intelligence research.

In Korea, discrimination and hatred toward sexual minorities, those of different race, and the disabled were discovered earlier this year from "Iruda," an artificial intelligence-based Korean chatbot, and became a hot topic. In such a context (see Figure below), the importance of ethics regarding artificial intelligence is growing enough to the point where the Presidential Committee On The Fourth Industrial Revolution and the Ministry of Science and ICT suggested "the principle of human dignity" as one of the tope three principles and "respect for diversity" as one of the top 10 core principles.

▲ Derogatory language used against the LGBTQ community that was captured on the AI-based Korean chatbot ‘Iruda’
(Image source: online community)

Nevertheless, even though research on artificial intelligence-based Korean language models, which is the basis of natural language processing, has been actively conducted until recently, the means for diagnosing their social biases are still lacking.

Amidst being aware of such problems, the K-StereoSet, which will be presented by Professor Sungroh Yoon's team and has been based on the development set of the "StereoSet" released by MIT to diagnose the social bias of the English language model, will continue to be modified and developed in line with the situation of Korea's reality. The original dataset, consisting of about 4,000 samples, was first translated through the Naver Papago API, and then independently inspected and reviewed by a number of researchers, and was built by post-processing to preserve the original sample form and purpose.

▲ Samples of the intrasentence bias diagnostic test and the intersentence bias diagnostic test
(Image source: Professor Sungroh Yoon’s research team at Seoul National University)

The field of social bias in the data consists of a total of four fields: gender, religion, occupation, and race, and the sample form for diagnosing bias is classified into two categories.

The first is an 'intrasentence' form for testing bias within sentences. (See the example on the left of the image above)

When a blank sentence is given, it is diagnosed using which of the three examples provided is given a high score as the content to be filled in the blank. For example, as shown in the example on the left above, it is possible to check whether a person in the profession of 'psychologist' is given a bias of 'self-righteousness' within a sentence.

The second is the ‘intersentence’ form which is a bias diagnostic test that may occur between different sentences. (Refer to the example on the right of the image above) When a certain context is given, three options are given as the next sentence, and diagnosis is made using which sentence is chosen to be given the highest score. For example, given the context that a person is 'Hispanic' as in the example on the right above, in the following sentence, it is possible to check whether there is any bias such whether the individual is an 'illegal citizen'.

“Amongst the bias diagnosis samples, those within the ‘unrelated’ label words that are completely unrelated to the context are placed in the blanks, and thus it is easy to deviate from the intended meaning of the original text during the process of automatic translation. Furthermore, the example sentences of the inter-sentence bias diagnosis sample were converted while paying attention to special situations, such as cases where contextual sentences were not considered,” said researcher JongYun Song who led the research.

"As artificial intelligence-based Korean language models are getting more advanced and commercialized, efforts to secure ethics and eliminate bias are key, and in alignment with Hangul Day, we hope that K-StereoSet will be a small but meaningful first step towards developing artificial intelligence technology that uses better Hangul," said Professor Sungroh Yoon, the research director.