Data Quality and Cleaning in Big Data
About the Project
The proliferation of Big Data analytics is developing ever more sophisticated models for intelligent data-driven insight and decision making in business and in other areas such as health and social care. Machine learning (ML) plays a crucial role in such development. On one hand the performance of an ML model is heavily dependent on the quality of data it used for training; on the other hand ML techniques can be used for data cleaning. However critical issues relating to the data quality that is required for these models to be effective and trustworthy are not getting the attention they deserve. This project will investigate data quality and data cleaning in Big Data by the use of ML techniques, including the use of deep learning methods, focusing on how the characteristics of Big Data affect the suitability of existing data quality/data cleaning approaches. The main aim of the this project is to develop a novel ML-based data cleaning framework, which can be used in Big Data applications. To achieve this aim, the following objectives need to be challenged:
- undertake research into current data quality/cleaning techniques/methods, especially ML-based approaches;
- investigate and evaluate current existing data cleaning frameworks, especially ML-based frameworks;
- Develop a novel ML-based framework for data cleaning, including deep learning methods;
- Evaluate the proposed framework
The area of applications, such as banking, retail, manufacturing, internet of things, or health and social care will be for the successful candidate to determine in conversation with the supervisors.
Prospective applicants are encouraged to contact the Supervisor before submitting their applications. Applications should make it clear the project you are applying for and the name of the supervisor(s).
Academic qualifications
A first degree (at least a 2.2) ideally in Mathematics or Computing with a good fundamental knowledge of Data Science and Algorithms, Machine Learning ideally.
English language requirement
IELTS score must be at least 6.5 (with not less than 6.0 in each of the four components). Other, equivalent qualifications will be accepted. Full details of the University’s policy are available online.
Essential attributes:
- Experience in fundamental database applications
- Competent in data structure and algorithms
- Knowledge of data science
- Good written and oral communication skills
- Strong motivation, with evidence of independent research skills relevant to research project
- Good time management skills
Desirable attributes:
- A basic understanding of data quality and data cleaning would be beneficial.
When applying click here
APPLICATION CHECKLIST
- Completed application form
- CV
- 2 academic references, using the Postgraduate Educational Reference Form (download)
- Research project outline of 2 pages (list of references excluded). The outline may provide details about:
- Background and motivation of the project. The motivation, explaining the importance of the project, should be supported also by relevant literature. You can also discuss the applications you expect for the project results.
- Research questions or objectives.
- Methodology: types of data to be used, approach to data collection, and data analysis methods.
- List of references.
The outline must be created solely by the applicant. Supervisors can only offer general discussions about the project idea without providing any additional support.
- Statement no longer than 1 page describing your motivations and fit with the project.
- Evidence of proficiency in English (if appropriate)
To be considered, the application must use
- the advertised title as project title
For informal enquiries about this PhD project, please contact Dr Taoxin Peng - t.peng@napier.ac.uk
References
C. Batini and M. Scannapieco, Data and Infomration Quality: Dimensions, Principles and Techniques, Springer, 2016.
H. Liu, A. Kumar T.K., J. P. Thomas and X. Hou, Cleaning Framework for BigData: An Interactive Approach for Data Cleaning, 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), 2016, pp. 174-181, doi: 10.1109/BigDataService.2016.41.
F. Ridzuan and Z. Wan, A review on data cleansing methods for big data. Procedia Comput Sci. 2019, doi.org/10.1016/j.procs.2019.11.177
Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X. and Zhang, C., 2019. CleanML: A
Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. arxiv preprint arxiv:1904.09483
Neutatz, F., Chen, B., Abedjan, Z. and Wu, E., 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull., 44(1), 24-41.
Côté, P.O., Nikanjam, A., Ahmed, N., Humeniuk, D. and Khomh, F., 2024. Data cleaning and machine learning: a systematic literature review. Automated Software Engineering, 31(2), 54
Unlock this job opportunity
View more options below
View full job details
See the complete job description, requirements, and application process


