Papers
(*: Equal contribution)
Preprint
-
Learning to Teach with Student Feedback
Yitao Liu*, Tianxiang Sun*, Xipeng Qiu, Xuanjing Huang
arXiv 2109.04641
pdf
/
abstract
Knowledge distillation (KD) has gained much attention due to its
effectiveness in compressing large-scale pre-trained models. In typical
KD methods, the small student model is trained to match the soft targets
generated by the big teacher model. However, the interaction between
student and teacher is one-way. The teacher is usually fixed once
trained, resulting in static soft targets to be distilled. This one-way
interaction leads to the teacher's inability to perceive the
characteristics of the student and its training progress. To address
this issue, we propose Interactive Knowledge Distillation (IKD), which
also allows the teacher to learn to teach from the feedback of the
student. In particular, IKD trains the teacher model to generate
specific soft target at each training step for a certain student. Joint
optimization for both teacher and student is achieved by two iterative
steps: a course step to optimize student with the soft target of
teacher, and an exam step to optimize teacher with the feedback of
student. IKD is a general framework that is orthogonal to most existing
knowledge distillation methods. Experimental results show that IKD
outperforms traditional KD methods on various NLP tasks.
-
Early Exiting with Ensemble Internal Classifiers
Tianxiang Sun*, Yunhua Zhou*, Xiangyang Liu, Xinyu Zhang,
Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu
arXiv 2105.13792
pdf
/
abstract
As a simple technique to accelerate inference of large-scale pre-trained
models, early exiting has gained much attention in the NLP community. It
allows samples to exit early at internal classifiers without passing
through the entire model. Most existing work usually trains the internal
classifiers independently and employs an exiting strategy to decide
whether or not to exit based on the confidence of the current internal
classifier. However, none of these works takes full advantage of the
fact that the internal classifiers are trained to solve the same task
therefore can be used to construct an ensemble. In this paper, we show
that a novel objective function for the training of the ensemble
internal classifiers can be naturally induced from the perspective of
ensemble learning and information theory. The proposed training
objective consists of two terms: one for accuracy and the other for the
diversity of the internal classifiers. In contrast, the objective used
in prior work is exactly the accuracy term of our training objective
therefore only optimizes the accuracy but not diversity. Further, we
propose a simple voting-based strategy that considers predictions of all
the past internal classifiers to infer the correct label and decide
whether to exit. Experimental results on various NLP tasks show that our
proposed objective function and voting-based strategy can achieve better
accuracy-speed trade-offs.
2023
-
Improving Contrastive Learning of Sentence Embeddings from
AI Feedback
Qinyuan Cheng, Xiaogui Yang, Tianxiang Sun, Linyang Li,
Xipeng Qiu
ACL (Findings) 2023
pdf
/
abstract
/
code
Contrastive learning has become a popular approach in natural language
processing, particularly for the learning of sentence embeddings.
However, the discrete nature of natural language makes it difficult to
ensure the quality of positive and negative sample pairs generated
through data augmentation methods. Although supervised contrastive
learning can produce more accurate sample pairs with human feedback
labels, it still lacks fine-grained training signals. In this paper, we
propose to improve Contrastive Learning of sentence embeddings from AI
Feedback (CLAIF). Our method utilizes AI feedback from large pre-trained
language models (LLMs) to construct sample pairs with fine-grained
sample similarity scores to improve contrastive learning. Besides, we
combine human feedback and AI feedback to provide better supervision
signals for supervised contrastive learning of sentence embeddings.
Experimental results show that our method achieves state-of-the-art
performance on several semantic textual similarity (STS) and transfer
learning tasks compared to other unsupervised and supervised contrastive
learning methods.
-
CodeIE: Large Code Generation Models are Better Few-Shot
Information Extractors
Peng Li*, Tianxiang Sun*, Qiong Tang, Hang Yan, Yuanbin Wu,
Xuanjing Huang, Xipeng Qiu
ACL 2023
pdf
/
abstract
/
code
Large language models (LLMs) pre-trained on massive corpora have
demonstrated impressive few-shot learning ability on many NLP tasks. A
common practice is to recast the task into a text-to-text format such
that generative LLMs of natural language (NL-LLMs) like GPT-3 can be
prompted to solve it. However, it is non-trivial to perform information
extraction (IE) tasks with NL-LLMs since the output of the IE task is
usually structured and therefore is hard to be converted into plain
text. In this paper, we propose to recast the structured output in the
form of code instead of natural language and utilize generative LLMs of
code (Code-LLMs) such as Codex to perform IE tasks, in particular, named
entity recognition and relation extraction. In contrast to NL-LLMs, we
show that Code-LLMs can be well-aligned with these IE tasks by designing
code-style prompts and formulating these IE tasks as code generation
tasks. Experiment results on seven benchmarks show that our method
consistently outperforms fine-tuning moderate-size pre-trained models
specially designed for IE tasks (e.g., UIE) and prompting NL-LLMs under
few-shot settings. We further conduct a series of in-depth analyses to
demonstrate the merits of leveraging Code-LLMs for IE tasks.
-
DiffusionBERT: Improving Generative Masked Language Models
with Diffusion Models
Zhengfu He*, Tianxiang Sun*, Kuanning Wang, Xuanjing Huang,
Xipeng Qiu
ACL 2023
pdf
/
abstract
/
code
We present DiffusionBERT, a new generative masked language model based
on discrete diffusion models. Diffusion models and many pre-trained
language models have a shared training objective, i.e., denoising,
making it possible to combine the two powerful models and enjoy the best
of both worlds. On the one hand, diffusion models offer a promising
training strategy that helps improve the generation quality. On the
other hand, pre-trained denoising language models (e.g., BERT) can be
used as a good initialization that accelerates convergence. We explore
training BERT to learn the reverse process of a discrete diffusion
process with an absorbing state and elucidate several designs to improve
it. First, we propose a new noise schedule for the forward diffusion
process that controls the degree of noise added at each step based on
the information of each token. Second, we investigate several designs of
incorporating the time step into BERT. Experiments on unconditional text
generation demonstrate that DiffusionBERT achieves significant
improvement over existing diffusion models for text (e.g., D3PM and
Diffusion-LM) and previous generative masked language models in terms of
perplexity and BLEU score.
-
Multitask Pre-training of Modular Prompt for Few-Shot
Learning
Tianxiang Sun*, Zhengfu He*, Qin Zhu, Xipeng Qiu, Xuanjing
Huang
ACL 2023
pdf
/
abstract
/
code
Prompt tuning is a parameter-efficient approach to adapting pre-trained
language models to downstream tasks. Although prompt tuning has been
shown to match the performance of full model tuning when training data
is sufficient, it tends to struggle in few-shot learning settings. In
this paper, we present Multi-task Pre-trained Modular Prompt (MP2) to
boost prompt tuning for few-shot learning. MP2 is a set of combinable
prompts pre-trained on 38 Chinese tasks. On downstream tasks, the
pre-trained prompts are selectively activated and combined, leading to
strong compositional generalization to unseen tasks. To bridge the gap
between pre-training and fine-tuning, we formulate upstream and
downstream tasks into a unified machine reading comprehension task.
Extensive experiments under two learning paradigms, i.e., gradient
descent and black-box tuning, show that MP2 significantly outperforms
prompt tuning, full model tuning, and prior prompt pre-training methods
in few-shot settings. In addition, we demonstrate that MP2 can achieve
surprisingly fast and strong adaptation to downstream tasks by merely
learning 8 parameters to combine the pre-trained modular prompts.
2022
-
BBTv2: Towards a Gradient-Free Future with Large Language
Models
Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuanjing
Huang, Xipeng Qiu
EMNLP 2022
pdf
/
abstract
/
code
Most downstream adaptation methods tune all or part of the parameters of
pre-trained models (PTMs) through gradient descent, where the tuning
cost increases linearly with the growth of the model size.
By contrast, gradient-free methods only require the forward computation
of the PTM to tune the prompt, retaining the benefits of efficient
tuning and deployment.
Though, past work on gradient-free tuning often introduces gradient
descent to seek a good initialization of prompt and lacks versatility
across tasks and PTMs.
In this paper, we present BBTv2, an improved version of Black-Box
Tuning, to drive PTMs for few-shot learning.
We prepend continuous prompts to every layer of the PTM and propose a
divide-and-conquer gradient-free algorithm to optimize the prompts at
different layers alternately.
Extensive experiments across various tasks and PTMs show that BBTv2 can
achieve comparable performance to full model tuning and state-of-the-art
parameter-efficient methods (e.g., Adapter, LoRA, BitFit, etc.) under
few-shot settings while maintaining much fewer tunable parameters.
-
BERTScore is Unfair: On Social Bias in Language Model-Based
Metrics for Text Generation
Tianxiang Sun*, Junliang He*, Xipeng Qiu, Xuanjing Huang
EMNLP 2022
pdf
/
abstract
/
code
Automatic evaluation metrics are crucial to the development of
generative systems.
In recent years, pre-trained language model (PLM) based metrics, such as
BERTScore, have been commonly adopted in
various generation tasks.
However, it has been demonstrated that PLMs encode a range of
stereotypical societal biases, leading to a concern on the fairness of
PLMs as metrics.
To that end, this work presents the first systematic study on the social
bias in PLM-based metrics.
We demonstrate that popular PLM-based metrics exhibit significantly
higher social bias than traditional metrics on 6 sensitive attributes,
namely race, gender, religion, physical appearance, age, and
socioeconomic status.
In-depth analysis suggests that choosing paradigms (matching,
regression, or generation) of the metric has a greater impact on
fairness than choosing PLMs.
In addition, we develop debiasing adapters that are injected into PLM
layers, mitigating bias in PLM-based metrics while retaining high
performance for evaluating text generation.
-
Late Prompt Tuning: A Late Prompt Could Be Better Than Many
Prompts
Xiangyang Liu, Tianxiang Sun, Xuanjing Huang, Xipeng Qiu
EMNLP (Findings) 2022
pdf
/
abstract
/
code
Prompt tuning is a parameter-efficient tuning (PETuning) method for
utilizing pre-trained models (PTMs) that simply prepends a soft prompt
to the input and only optimizes the prompt to adapt PTMs to downstream
tasks. Although it is parameter- and deployment-efficient, its
performance still lags behind other state-of-the-art PETuning methods.
Besides, the training cost of prompt tuning is not significantly reduced
due to the back-propagation through the entire model. Through empirical
analyses, we shed some light on the lagging performance of prompt tuning
and recognize a trade-off between the propagation distance from label
signals to the inserted prompt and the influence of the prompt on model
outputs. Further, we present Late Prompt Tuning (LPT) that inserts a
late prompt into an intermediate layer of the PTM instead of the input
layer or all layers. The late prompt is obtained by a neural prompt
generator conditioned on the hidden states before the prompt insertion
layer and therefore is instance-dependent. Through extensive
experimental results across various tasks and PTMs, we show that LPT can
achieve competitive performance to full model tuning and other PETuning
methods under both full-data and few-shot scenarios while possessing
faster training speed and lower memory cost.
-
Black-Box Tuning for Language-Model-as-a-Service
Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang,
Xipeng Qiu
ICML 2022   (Spotlight)
pdf
/
abstract
/
code
/
slides
Extremely large pre-trained language models (PTMs) such as GPT-3 are
usually released as a service. It allows users to design task-specific
prompts to query the PTMs through some black-box APIs. In such a
scenario, which we call Language-Model-as-a-Service (LMaaS), the
gradients of PTMs are usually unavailable. Can we optimize the task
prompts by only accessing the model inference APIs? This paper proposes
the black-box tuning framework to optimize the continuous prompt
prepended to the input text via derivative-free optimization. Instead of
optimizing in the original high-dimensional prompt space, which is
intractable for traditional derivative-free optimization, we perform
optimization in a randomly generated subspace due to the low intrinsic
dimensionality of large PTMs. The experimental results show that the
black-box tuning with RoBERTa on a few labeled samples not only
significantly outperforms manual prompt and GPT-3’s in-context learning,
but also surpasses the gradient-based counterparts, i.e., prompt tuning
and full model tuning.
-
Paradigm Shift in Natural Language Processing
Tianxiang Sun, Xiangyang Liu, Xipeng Qiu, Xuanjing Huang
Machine Intelligence Research 2022  
(Invited Paper)
pdf
/
abstract
/
project
/
slides
In the era of deep learning, modeling for most natural language
processing (NLP) tasks has converged into several mainstream paradigms.
For example, we usually adopt the sequence labeling paradigm to solve a
bundle of tasks such as POS-tagging, named entity recognition (NER), and
chunking, and adopt the classification paradigm to solve tasks like
sentiment analysis. With the rapid progress of pre-trained language
models, recent years have witnessed a rising trend of paradigm shift,
which is solving one NLP task in a new paradigm by reformulating the
task. The paradigm shift has achieved great success on many tasks and is
becoming a promising way to improve model performance. Moreover, some of
these paradigms have shown great potential to unify a large number of
NLP tasks, making it possible to build a single model to handle diverse
tasks. In this paper, we review such phenomenon of paradigm shifts in
recent years, highlighting several paradigms that have the potential to
solve different NLP tasks.
-
Towards Efficient NLP: A Standard Evaluation and A Strong
Baseline
Xiangyang Liu*, Tianxiang Sun*, Junliang He, Jiawen Wu,
Lingling Wu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu
NAACL 2022   (Oral
Presentation)
pdf
/
abstract
/
code
/
benchmark
/
slides
Supersized pre-trained language models have pushed the accuracy of
various natural language processing (NLP) tasks to a new
state-of-the-art (SOTA). Rather than pursuing the reachless SOTA
accuracy, more and more researchers start paying attention to model
efficiency and usability. Different from accuracy, the metric for
efficiency varies across different studies, making them hard to be
fairly compared. To that end, this work presents ELUE (Efficient
Language Understanding Evaluation), a standard evaluation, and a public
leaderboard for efficient NLP models. ELUE is dedicated to depicting the
Pareto Frontier for various language understanding tasks, such that it
can tell whether and how much a method achieves Pareto improvement.
Along with the benchmark, we also release a strong baseline,
ElasticBERT, which allows BERT to exit at any layer in both static and
dynamic ways. We demonstrate the ElasticBERT, despite its simplicity,
outperforms or performs on par with SOTA compressed and early exiting
models. With ElasticBERT, the proposed ELUE has a strong Pareto Frontier
and makes a better evaluation for efficient NLP models.
-
A Simple Hash-Based Early Exiting Approach For Language
Understanding and Generation
Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng,
Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, Xipeng Qiu
ACL (Findings) 2022
pdf
/
abstract
/
code
/
slides
Early exiting allows instances to exit at different layers according to
the estimation of difficulty.Previous works usually adopt heuristic
metrics such as the entropy of internal outputs to measure instance
difficulty, which suffers from generalization and threshold-tuning. In
contrast, learning to exit, or learning to predict instance difficulty
is a more appealing way. Though some effort has been devoted to
employing such “learn-to-exit” modules, it is still unknown whether and
how well the instance difficulty can be learned. As a response, we first
conduct experiments on the learnability of instance difficulty, which
demonstrates that modern neural models perform poorly on predicting
instance difficulty. Based on this observation, we propose a
simple-yet-effective Hash-based Early Exiting approach HashEE) that
replaces the learn-to-exit modules with hash functions to assign each
token to a fixed exiting layer. Different from previous methods, HashEE
requires no internal classifiers nor extra parameters, and therefore is
more efficient.HashEE can be used in various tasks (including language
understanding and generation) and model architectures such as seq2seq
models. Experimental results on classification, regression, and
generation tasks demonstrate that HashEE can achieve higher performance
with fewer FLOPs and inference time compared with previous
state-of-the-art early exiting methods.
|
2021
-
Accelerating BERT Inference for Sequence Labeling via
Early-Exit
Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng
Qiu, Xuanjing Huang
ACL 2021
pdf
/
abstract
/
code
Both performance and efficiency are crucial factors for sequence
labeling tasks in many real-world scenarios. Although the pre-trained
models (PTMs) have significantly improved the performance of various
sequence labeling tasks, their computational cost is expensive. To
alleviate this problem, we extend the recent successful early-exit
mechanism to accelerate the inference of PTMs for sequence labeling
tasks. However, existing early-exit mechanisms are specifically designed
for sequence-level tasks, rather than sequence labeling. In this paper,
we first propose a simple extension of sentence-level early-exit for
sequence labeling tasks. To further reduce the computational cost, we
also propose a token-level early-exit mechanism that allows partial
tokens to exit early at different layers. Considering the local
dependency inherent in sequence labeling, we employed a window-based
criterion to decide for a token whether or not to exit. The token-level
early-exit brings the gap between training and inference, so we
introduce an extra self-sampling fine-tuning stage to alleviate it. The
extensive experiments on three popular sequence labeling tasks show that
our approach can save up to 66%∼75% inference cost with minimal
performance degradation. Compared with competitive compressed models
such as DistilBERT, our approach can achieve better performance under
the same speed-up ratios of 2×, 3×, and 4×.
-
Does Syntax Matter? A Strong Baseline for Aspect-Based
Sentiment Analysis with RoBERTa
Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, Xipeng Qiu
NAACL 2021
pdf
/
abstract
/
code
/
slides
Aspect-based Sentiment Analysis (ABSA), aiming at predicting the
polarities for aspects, is a fine-grained task in the field of sentiment
analysis. Previous work showed syntactic information, e.g. dependency
trees, can effectively improve the ABSA performance. Recently,
pre-trained models (PTMs) also have shown their effectiveness on ABSA.
Therefore, the question naturally arises whether PTMs contain sufficient
syntactic information for ABSA so that we can obtain a good ABSA model
only based on PTMs. In this paper, we firstly compare the induced trees
from PTMs and the dependency parsing trees on several popular models for
the ABSA task, showing that the induced tree from fine-tuned RoBERTa
(FT-RoBERTa) outperforms the parser-provided tree. The further analysis
experiments reveal that the FT-RoBERTa Induced Tree is more
sentiment-word-oriented and could benefit the ABSA task. The experiments
also show that the pure RoBERTa-based model can outperform or
approximate to the previous SOTA performances on six datasets across
four languages since it implicitly incorporates the task-oriented
syntactic information.
|
2020
-
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai,
Xuanjing Huang
SCIENCE CHINA Technological Sciences 2020  
(Invited Paper, Most Influential Paper of SCTS in 2020)
pdf
/
abstract
Recently, the emergence of pre-trained models (PTMs) has brought natural
language processing (NLP) to a new era. In this survey, we provide a
comprehensive review of PTMs for NLP. We first briefly introduce
language representation learning and its research progress. Then we
systematically categorize existing PTMs based on a taxonomy from four
different perspectives. Next, we describe how to adapt the knowledge of
PTMs to downstream tasks. Finally, we outline some potential directions
of PTMs for future research. This survey is purposed to be a hands-on
guide for understanding, using, and developing PTMs for various NLP
tasks.
-
CoLAKE: Contextualized Language and Knowledge Embedding
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru
Hu, Xuanjing Huang, Zheng
Zhang
COLING 2020
pdf
/
abstract
/
code
/
slides
With the emerging branch of incorporating factual knowledge into
pre-trained language models such as BERT, most existing models consider
shallow, static, and separately pre-trained entity embeddings, which
limits the performance gains of these models. Few works explore the
potential of deep contextualized knowledge representation when injecting
knowledge. In this paper, we propose the Contextualized Language and
Knowledge Embedding (CoLAKE), which jointly learns contextualized
representation for both language and knowledge with the extended MLM
objective. Instead of injecting only entity embeddings, CoLAKE extracts
the knowledge context of an entity from large-scale knowledge bases. To
handle the heterogeneity of knowledge context and language context, we
integrate them in a unified data structure, word-knowledge graph (WK
graph). CoLAKE is pre-trained on large-scale WK graphs with the modified
Transformer encoder. We conduct experiments on knowledge-driven tasks,
knowledge probing tasks, and language understanding tasks. Experimental
results show that CoLAKE outperforms previous counterparts on most of
the tasks. Besides, CoLAKE achieves surprisingly high performance on our
synthetic task called word-knowledge graph completion, which shows the
superiority of simultaneously contextualizing language and knowledge
representation.
-
Learning Sparse Sharing Architectures for Multiple Tasks
Tianxiang Sun*, Yunfan Shao*, Xiaonan Li, Pengfei Liu, Hang
Yan, Xipeng Qiu, Xuanjing
Huang
AAAI 2020   (Oral
Presentation)
pdf
/
abstract
/
code
/
slides
Most existing deep multi-task learning models are based on parameter
sharing, such as hard sharing, hierarchical sharing, and soft sharing.
How choosing a suitable sharing mechanism depends on the relations among
the tasks, which is not easy since it is difficult to understand the
underlying shared factors among these tasks. In this paper, we propose a
novel parameter sharing mechanism, named Sparse Sharing. Given multiple
tasks, our approach automatically finds a sparse sharing structure. We
start with an over-parameterized base network, from which each task
extracts a subnetwork. The subnetworks of multiple tasks are partially
overlapped and trained in parallel. We show that both hard sharing and
hierarchical sharing can be formulated as particular instances of the
sparse sharing framework. We conduct extensive experiments on three
sequence labeling tasks. Compared with single-task models and three
typical multi-task learning baselines, our proposed approach achieves
consistent improvement while requiring fewer parameters.
|
|