Shagun Dwivedi
ABOUT ME
EDUCATION
PROJECTS
PUBLICATIONS
CONTACT
CV
ABOUT ME EDUCATION PROJECTS PUBLICATIONS CONTACT CV

Updates

Jan 2026
HTR of Grantha Manuscripts
I'm exploring different OCR methods like Vision Transformers, CRNNs, and other techniques to improve accuracy. Compiling some insights about the variations in the script here.
Sep 2025
Built the Legislative Analysis Application
A legislative document research tool which provides real-time document analysis with visual relationship mapping, using MCP architecture and vector databases. Check out the demo
Aug 2025
Check out Entia
Your go-to Express Add-on for personalized video analytics and persona-based feedback. We won the Best New Video Innovation category in Adobe Express Add-ons Hackathon!

Hello! I'm Shagun Dwivedi

I'm an AI Researcher, with a passion for data, writing systems, languages, and accuracy. I currently work at the Centre for Interdisciplinary AI, FLAME University, Pune, India. I have a BS in Data Science and Applications from IIT Madras. I have a BA (Hons) in Ancient Indian History and Archaeology from University of Lucknow.

I'm interested in natural language processing, use of language models in machine perception, and evaluation benchmarks/metrics for LLMs.

Education

2025
BS in Data Science and Applications, IIT Madras
CGPA : 9.04
2023
BA (Hons.) in Ancient Indian History and Archaeology, University of Lucknow
CGPA: 8.02
2020
Class XII (PCM with CS) Jagran Public School, Lucknow
Percentage: 94%
2018
Class X, Jagran Public School, Lucknow
Percentage: 91.4%

Check out some of my projects

I have primarily worked on:

Natural Language Processing Optical Character Recognition Web Development

Analysis of different Tokenizers and their effect on Downstream tasks for Hindi and Marathi

Evaluation of various subword tokenization methods and their impact on language modeling performance for Hindi and Marathi.

HuggingFace Transformers LLM Evaluation Question Answering Grapheme to Phoneme + MORE

Proposing a method of tokenization, which could help reduce the issue of sub-optimal tokenization with over-segmentation of text in Indic languages and its effect on a language modeling (using T5).

MANUSCRIPT IN REVIEW

Downscaling Terrestrial Water Storage Anomalies

Algorithm that produces a higher spatial resolution TWSA using 6 months of precipitation and temperature.

Remote Sensing Machine Learning Statistics + MORE

Implemented a neighborhood-weighted iterative optimization algorithm to downscale TWSA from 3°×3° to 0.1°×0.1° resolution achieving better RMSE and spatial frequency scores than GLDAS estimates.

MANUSCRIPT IN PREPARATION

Converting Custom-Embedded Subsetted Non-Unicode Fonts to Searchable Formats

Extracting Indic text from non-Unicode standard PDFs with embedded fonts to a searchable and Unicode-compliant format.

Computational Linguistics Optical Character Recognition Low Resource Scripts + MORE

Expanding on previously published case study in Gujarati by including Hindi, Marathi, and Bangla documents.

Handwritten Text Recognition for Pre-Colonial Sanskrit manuscripts

Digitizing manuscripts on Vedanta and Mimansa to make them searchable.

Optical Character Recognition Finetuning Evaluation Metrics + MORE

Finetuned a Resnet-BiLSTM-CTC model for handwritten manuscripts which outperforms out-of-the-box HTR models on grapheme cluster error rate and character error rate.

GO TO PROJECT

AI Language Practice Applications for French and German

Platform for beginner French 101 and German 101 courses' practice customised to university course syllabi.

Machine Learning Python Scikit-Learn + MORE

Designed multi-agent system supported by various custom prompt-engineered OpenAI bots for the dialog framework. Presented at AIET 2025.

GO TO PROJECT

Customer Behavior Prediction

Predicting user behaviour based on the restaurant's offers.

Machine Learning Python Scikit-Learn + MORE

Performed preprocessing, trained and validated and evaluated multiple supervised classification models and decided on the best prediction model.

GO TO PROJECT

Indian Weather API

API for India Meteorological Department Weather Data

RESTful API Web Scraping Flask-Restful + MORE

Access brief or detailed weather report based on station's name. Get forecast for the next seven days.

GO TO PROJECT

AirIndia Case Study

Analysis and formulation of growth strategies for AirIndia post acquisition by TATA Group

Market Research Analysis Strategy Development

Case study ranked first under Winter Consulting'22, IIT Guwahati. Conducted extensive market research, formulated strategies for airline's improvement based on the growth strategy framework.

GO TO PROJECT

TrackOn App

Flask based web application for tracking habits, activities, and other life parameters.

Flask SQLite VueJs Redis Celery + MORE

TrackOn is a web application used for tracking habits, activities, other life parameters. Users can register & login to create multiple trackers with multiple logs. They can review their progress over time with graphs trend lines.

GO TO PROJECT

Publications

A Case Study of Handwritten Text Recognition from Pre-Colonial era Sanskrit Manuscripts

Kartik Chincholikar, Shagun Dwivedi, Kaushik Gopalan, Tarinee Awasthi
World Sanskrit Conference, 2025
This paper presents a comprehensive study on applying handwritten text recognition techniques to pre-colonial era Sanskrit manuscripts. We discuss the challenges unique to these historical documents and our approach to achieving high accuracy in recognition.

A Semi-Automatic Text Recognition Tool for Pre-Colonial Handwritten Manuscripts in Devanāgari Script

Bharath Valaboju, Shagun Dwivedi, Kartik Chincholikar, Kaushik Gopalan, Vinod Vidwans
International Conference on Human-Computer Interaction, 2025
This poster presents an annotation tool which allows the user to extract text from undigitized manuscripts using OCR, following which users can make corrections to the OCR-detected text.

Converting Gujarati Text in Custom-Embedded Subsetted Non-unicode Fonts to Searchable Formats: A Case Study Using Jain Religious Texts

Rishav Jain, Shagun Dwivedi, Kaushik Gopalan
International Conference on Computer Vision and Image Processing, 2024
This study presents a novel approach for extracting Gujarati text from non-Unicode standard PDFs with embedded fonts, addressing the challenges posed by these legacy fonts and providing a pathway to convert and preserve such documents in a searchable, Unicode-compliant format.

Contact Me: