Responsibilities • Design, develop, and maintain scalable data pipelines using PySpark and related big data technologies. • Work with large datasets and develop data models for consumption by data scientists and analysts. • Optimize Spark jobs for better performance and resource management. • Design and implement data integration workflows between various data sources. • Troubleshoot and resolve issues related to data pipelines. • Collaborate with cross-functional teams to understand business requirements and deliver solutions. • Ensure data quality and cleanliness using validation and transformation techniques. • Write and maintain efficient, scalable code in Python and PySpark. • Manage data storage, computation, and scaling on cloud platforms like AWS or Azure. Requirement Must Have: • Bachelor’s degree in computer science, Engineering, or a related field. • Minimum 5 years of experience with PySpark for data processing and manipulation on large-scale datasets. • Solid understanding of Spark architecture, including RDDs,DataFrames, and Datasets. • Strong programming experience in Python, including libraries such as pandas, numpy, and matplotlib. • Experience with Hadoop, Hive, and NoSQL databases (e.g.,Cassandra, MongoDB). • Working knowledge of cloud computing services (e.g., AWS, Azure, or Google Cloud). • Familiarity with batch and stream processing (using Kafka, Flink, Spark Streaming). • Strong problem-solving skills and attention to detail. • Excellent communication and teamwork skills. Good to have skills: • Experience with Apache Airflow or other orchestration tools for managing workflows. • Familiarity withDocker or Kubernetes for containerized data environments. • Experience in implementing and managing Continuous Integration (CI) and Continuous Delivery (CD) pipelines, with a focus on automating, testing, and deploying code in a fast-paced development environment.
Job Title
Technical Lead-PySpark