Social network you want to login/join with:You will be part of a team designing and building a Gen AI virtual agent to support customers and employees across multiple channels. You will build and run LLM-powered agentic experiences, owning the design, orchestration, MLOps, and continuous improvement.Design and build client-specific GenAI/LLM virtual agentsEnable the orchestration, management, and execution of AI-powered interactions through purpose-built AI agentsDesign, build and maintain robust LLM powered processing workflowsDevelop cutting edge testing suites related to bespoke LLM performance metricsCI/CD for ML/LLM: automated build/train/validate/deploy pipelines for chatbots and agent servicesIaC - Infrastructure as Code, (Terraform/CloudFormation) to provision scalable cloud for training and real-time inferenceObservability: monitoring, drift detection, hallucination, SLOs, and alerting for model and service healthServing at scale: containerised, auto-scaling (e.g., Kubernetes) with low-latency inferenceData and model versioning; maintain a central model registry with lineage and rollbackDeliver a live performance dashboard (intent accuracy, latency, error rates) and a documented retraining strategyLead and foster creativity around frameworks/models; collaborate closely with product, engineering, and client stakeholdersQualifications / ExperienceRelevant primary level degree and ideally MSc or PhDProven expertise in mathematics and classical ML algorithms, plus deep knowledge of LLMs (prompting, fine-tuning, RAG/tool use, evaluation)Hands-on with AWS and Azure services for data/ML (e.g., Bedrock/SageMaker, Azure OpenAI/Azure ML)Strong engineering: Python, APIs, containers, Git; CI/CD (GitHub Actions/Azure DevOps), IaC (Terraform/CloudFormation)Scalable Serving Infrastructure: A containerized, auto-scaling environment (e.g., using Kubernetes) to serve the chatbot model with low latencyWorkflow Automation: Automate the end-to-end machine learning lifecycle, from data ingestion and preprocessing to model retraining and deploymentLive Performance Dashboard: A real-time dashboard displaying key model metrics such as intent accuracy, response latency, and error ratesCentralized Model Registry: A versioned repository for all trained models, their performance metrics, and associated training dataDocumented Retraining Strategy: An automated workflow and documentation outlining the process for periodically retraining the model on new dataExperience with Kubernetes, inference optimisation, caching, vector stores, and model registriesClear communication, stakeholder management, and a habit of writing crisp technical docs and runbooksPersonal AttributesPersonal Integrity, Stakeholder Management, Project Management, Agile Methodologies, Automation, Data Visualisation and ..... full job details .....