Previous 3D human pose generation methods typically struggle with issues such as semantic mismatches or missing elements when handling detailed text description, hindering their applications that demand precise control based on detailed textual inputs. In this work, we introduce Semantic Mask Transformer(SMT), a Text-driven Animation Pipeline for synthesizing 3D poses that are semantically consistent with the detailed text descriptions. We leverage the semantic knowledge of the GPT language model with the mask training objective to enhance the local body part semantic consistency for the Mask Transformer. Specifically, SMT comprises two main components - Body part Group Residual Vector Quantize Autoencoder, which explicitly quantizes human poses into body part tokens, and Mask Transformer, which predicts masked body part tokens conditioned on text description. Experimental results show the effectiveness of the proposed method in producing high-quality pose generation results while accurately preserving textual semantics from the description.