RNN-Transducer Model for Nested Entity Recognition in Biomedical Literature

Në Vazhdim Postuar 2 ditë më parë Paguhet në dorëzim
Në Vazhdim Paguhet në dorëzim

Project Overview:

This project involves the development and training of a Recurrent Neural Network-Transducer (RNN-T) model for the challenging task of Nested Entity Recognition (NER) in biomedical literature. Nested NER aims to identify and classify multiple overlapping entities within a given text, such as genes, proteins, diseases, and drugs.

Key Responsibilities:

Data Preparation:

Data Collection: Gather and curate a high-quality dataset of biomedical literature annotated with nested entity labels.

Data Preprocessing:

Clean and preprocess the text data, including tokenization, sentence splitting, and handling special characters.

Implement appropriate data augmentation techniques to enhance model robustness and prevent overfitting.

Create suitable input and output representations for the RNN-T model, considering factors like character-level or word-level embeddings.

Model Development:

Architecture: Design and implement an RNN-T architecture suitable for nested entity recognition, potentially incorporating:

Encoder: Bidirectional LSTMs or GRUs for capturing contextual information.

Prediction Network: A connectionist temporal classification (CTC) or similar approach for joint acoustic and language modeling.

Attention Mechanisms: To improve focus on relevant parts of the input sequence.

Training: Train the model using an appropriate optimization algorithm (e.g., Adam) and loss function (e.g., Connectionist Temporal Classification loss).

Hyperparameter Tuning: Conduct thorough hyperparameter tuning (e.g., learning rate, dropout rate, hidden layer sizes) to optimize model performance.

Evaluation:

Metrics: Evaluate model performance using relevant metrics for nested NER, such as:

F1-score: For overall entity recognition accuracy.

Precision and Recall: For individual entity types.

Exact Match: To assess the accuracy of complete entity spans.

Analysis: Analyze model performance, identify areas for improvement, and generate insights into the challenges of nested NER in biomedical literature.

Documentation:

Code Documentation: Provide clear and concise documentation for all code, including comments, docstrings, and README files.

Project Report: Prepare a comprehensive report summarizing the project methodology, results, and findings.

Required Skills:

Strong proficiency in Python and deep learning frameworks (e.g., TensorFlow, PyTorch).

Solid understanding of Natural Language Processing (NLP) concepts, including tokenization, word embeddings, and sequence-to-sequence models.

Experience with RNN-T models or similar sequence-to-sequence architectures.

Familiarity with nested entity recognition and its challenges.

Experience with data preprocessing, feature engineering, and model evaluation.

Excellent communication and documentation skills.

Deliverables:

Trained RNN-T model: A well-trained and optimized model for nested entity recognition in biomedical literature.

Code: Clean, well-documented, and reproducible code for all stages of the project.

Data: Preprocessed and annotated datasets used for model training and evaluation.

Project Report: A comprehensive report detailing the project methodology, results, and findings.

Deep Learning NLP Tokenization Python Pytorch Tensorflow

ID Projekti: #38951454

Rreth projektit

5 propozimet Projekti në distancë Aktiv 2 ditë më parë